Annotate Clinvar With ExAC

The ExAC paper notes that some of the variants in ClinVar that are classified as pathogenic (or likely pathogenic) are actually in high enough (>1%) allele frequency in ExAC to indicate that it is unlikely that these are really pathogenic.

Here, we will use vcfanno to annotate the clinvar VCF with the allele frequencies for ExAC so that we can find variants that are indicated as pathogenic and rare in ExAC.

The ExAC reports the alternate counts and the total number of chromosomes (AN*) and the alternate allele counts (AC*) so, to we will annotate with those and then use postannotation in vcfanno to get the AF as AC/AN. We will use an already decomposed and normalized version of ExAC (but vcfanno will match on any of the alternate alleles if multiple are present for a given variant). The [[annotation]] section in the config file will look like this:

ExAC Config

[[annotation]]
file="ExAC.r0.3.sites.vep.tidy.vcf.gz"
fields = ["AC_Adj", "AN_Adj", "AC_AFR", "AN_AFR", "AC_AMR", "AN_AMR", "AC_EAS", "AN_EAS", "AC_FIN", "AN_FIN", "AC_NFE", "AN_NFE", "AC_OTH", "AN_OTH", "AC_SAS", "AN_SAS"]
names = ["ac_exac_all", "an_exac_all", "ac_exac_afr", "an_exac_afr", "ac_exac_amr", "an_exac_amr", "ac_exac_eas", "an_exac_eas", "ac_exac_fin", "an_exac_fin", "ac_exac_nfe", "an_exac_nfe", "ac_exac_oth", "an_exac_oth", "ac_exac_sas", "an_exac_sas"]
ops=["self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self", "self"]

Note that we can have as many of these sections as we want with vcfanno, but here we are only interested in annotating clinvar with the single ExAC file. The fields section indicates which fields to pull from the ExAC VCF. The names section indicates how those fields will be named as they are added to the clinvar VCF. Since we intend to match on REF and ALT, there will only be 1 match so the op is just "self" for all fields.

Because we want to know the allele frequency, we will need to divide AC by AN. This is done in a [[postannotation]] section that looks like this:

PostAnnotation

[[postannotation]]
fields=["ac_exac_all", "an_exac_all"]
name="af_exac_all"
op="div2"
type="Float"

We need one of these section for each population, which is onerous, but simple enough to generate with a small script. Note that the op div is provided by vcfanno, but we could have written this as a custom op in lua as:

function div(a, b)
    if(a == 0){ return 0.0; }
    return string.format("%.9f", a / b)
}

and then use:

op="lua:div(ac_exac_all[1], an_exac_all)"

in the [[postannotation]]. Note that we need to use [1] for ac_exac_all because it has Number=A in the header of the VCF so it might be a table. Here we take only the first value. We could instead modify the lua:div() function to return a comma-delimited list if len(a) > 1.

These postannotation sections are executed in the order they are specified so we can specify a final section that takes the maximum of all of the allele frequencies. This is informative as a truly pathogenic variant should have a low allele frequency in all populations. Here is the section to take the maximum AF of all the populations which we've already calculated:

[[postannotation]]
fields=["af_exac_all", "af_exac_afr", "af_exac_amr", "af_exac_eas", "af_exac_nfe", "af_exac_oth", "af_exac_sas"]
op="max"
name="max_aaf_all"
type="Float"

Flag Common Pathogenic

Finally, we can flag variants that have a max_aaf_all above some cutoff and are labelled as pathogenic.

[[postannotation]]
fields=["clinvar_sig", "max_aaf_all"]
op="lua:check_clinvar_aaf(clinvar_sig, max_aaf_all, 0.005)"
name="common_pathogenic"
type="Flag"

Note that we use 0.005 as the allele frequency cutoff. For any variant that was not present in ExAC, the max_aaf_all field will be absent from the INFO field and so this will not be called.

If we've saved this in a file called exac-af.conf then the vcfanno command looks like:

vcfanno -lua clinvar_exac.lua -p 4 -base-path $EXAC_DIR clinvar_exac.conf $CLINVAR_VCF > $CLINVAR_ANNOTATED_VCF

This command finishes in about 2 minutes on a good laptop with a core i7 processor.

An example INFO field from the clinvar file after annotation looks like this:

RS=17855739;RSPOS=5831840;RV;dbSNPBuildID=123;SSR=0;SAO=1;VP=0x050060000a05150136110100;GENEINFO=FUT6:2528;WGT=1;VC=SNV;PM;NSM;REF;ASP;VLD;G5;GNO;KGPhase1;KGPhase3;LSD;OM;CLNALLE=1;CLNHGVS=NC_000019.9:g.5831840C>T;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=136836.0001;CLNSIG=5;CLNDSDB=MedGen:OMIM;CLNDSDBID=C3151219:613852;CLNDBN=Fucosyltransferase_6_deficiency;CLNREVSTAT=single;CLNACC=RCV000017626.26;CAF=0.8393,0.1607;COMMON=1;ac_exac_all=10114;an_exac_all=121354;ac_exac_afr=3093;an_exac_afr=10402;ac_exac_amr=449;an_exac_amr=11572;ac_exac_eas=867;an_exac_eas=8638;ac_exac_fin=210;an_exac_fin=6612;ac_exac_nfe=2836;an_exac_nfe=66712;ac_exac_oth=62;an_exac_oth=906;ac_exac_sas=2597;an_exac_sas=16512;af_exac_all=0.0833;af_exac_afr=0.2973;af_exac_amr=0.0388;af_exac_eas=0.1004;af_exac_nfe=0.0425;af_exac_oth=0.0684;af_exac_sas=0.1573;max_aaf_all=0.2973;clinvar_sig=pathogenic;common_pathogenic

(NOTE that clinvar has applied the COMMON=1 tag here indicating a high AF in 1kg.)

With our ExAC fields appearing at the end:

ac_exac_all=10114;an_exac_all=121354;ac_exac_afr=3093;an_exac_afr=10402;ac_exac_amr=449;an_exac_amr=11572;ac_exac_eas=867;an_exac_eas=8638;ac_exac_fin=210;an_exac_fin=6612;ac_exac_nfe=2836;an_exac_nfe=66712;ac_exac_oth=62;an_exac_oth=906;ac_exac_sas=2597;an_exac_sas=16512;af_exac_all=0.0833;af_exac_afr=0.2973;af_exac_amr=0.0388;af_exac_eas=0.1004;af_exac_nfe=0.0425;af_exac_oth=0.0684;af_exac_sas=0.1573;max_aaf_all=0.2973;clinvar_sig=pathogenic;common_pathogenic

So this variant was classified as pathogenic, but has a max_aaf_all of 0.2973 and so it received the 'common_pathogenic' flag as did 566 other clinvar variants.

While the config file used to generate this final dataset was fairly involved, each step is very simple and it shows the power in vcfanno. However, note that for most analyses, it will be sufficient to specify a config file that pulls the fields of interest.

Supporting Files

The full config and lua files used to run this analysis are available here.