Day 3: Variant Annotation with VEP
Presentation
Ensembl VEP
VEP is a powerful and flexible tool for annotating, analyzing, and prioritizing genomic variants. It is particularly useful in cancer genomics, where we need to understand the potential functional impact of somatic mutations for research and clinical decisions.
With VEP we can start to address questions such as:
- How damaging is a specific somatic mutation according to reference databases?
- Is the mutation already known in this particular cancer type?
- What is the predicted functional impact of this mutation in the tumour context?
- Are there known associations between this mutation and specific drugs or drug classes?
- How common the mutation is in various populations?
Why do we need a tool like VEP?
In a typical cancer exome or panel we may obtain thousands of somatic SNVs and indels per sample. These variants:
- Are not analyzed one by one by hand.
- Need to be standardized (same genome build, same transcript model, consistent consequence terms).
- Need to be prioritized so that we can focus on potentially actionable or biologically relevant changes.
A tool like VEP allows us to:
- Run high‑throughput annotation of all variants in a VCF.
- Attach consistent functional predictions and database links (ClinVar, COSMIC, etc.).
- Reduce a long list of raw calls to a shorter list of interpretable candidate variants for downstream review.
In clinical oncology, somatic variants are classified using the AMP/ASCO/CAP framework (Li et al., 2017) — a four-tier system based on their clinical relevance:
| Tier | Category | Description |
|---|---|---|
| I | Strong clinical significance | FDA-approved therapy or included in professional guidelines for this tumor type |
| II | Potential clinical significance | FDA-approved therapy for a different tumor type, or investigational (clinical trials, small studies) |
| III | Unknown clinical significance | Not observed in cancer databases, or no convincing published evidence of cancer association |
| IV | Benign or likely benign | Observed at significant frequency in population databases (e.g. gnomAD), with no published cancer association |
This is different from the ACMG/AMP 5-tier system used for germline variants. In this course, the annotation and filtering steps you will perform (impact prediction, ClinVar lookup, population frequency filtering, REVEL/AlphaMissense scores) are the building blocks for this kind of clinical tiering — though formal classification requires additional clinical context beyond bioinformatics alone.
Reference: Li MM, et al. Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer. J Mol Diagn (2017) 19(1):4-23. doi:10.1016/j.jmoldx.2016.10.002
Key Concepts
Install VEP
This course uses VEP v115 (2025). Some features and default behaviors may differ in other versions. Always check the release notes for your version.
Setup
All the environment is prepared for you, you just need to create a symbolic link to access the data from your project directory:
cd ~/project
ln -s /data .If you want to install VEP locally or on your compute cluster, you can follow the instructions below.
# Installing Conda #https://docs.anaconda.com/miniconda/
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all
# Using Conda
conda create -n vep samtools python=3.10 ensembl-vep=115.0
# conda install -c bioconda ensembl-vep=115.0 # if installing in current env
# Activate VEP's env
conda activate vep
# PLEASE DO NOT RUN During the course, only if you follow locally (~60 min, > 25GB)
# Install all the data locally
#vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT (this command will download in $HOME/.vep the cache folder)
#vep_install -a cf -s homo_sapiens -y GRCh38 -c $HOME/project/data/vep --CONVERT (this specify the cache folder)
# Plugins installation
# vep_install -a p --PLUGINS allInstead of conda you could try installing mamba. If you are on Unix-based system this is simple mamba. . In case you are following locally, you can environment with mamba activate ngs-tools.
Transcript Selection
VEP can report consequences for all transcripts overlapping a variant. However, this often results in multiple annotations per variant. Several approaches are available to handle this:
--pick: Selects one consequence per variant based on a predefined hierarchy. Since VEP v106+, the hierarchy prioritises MANE Select > MANE Plus Clinical > Canonical > TSL 1 > APPRIS principal isoforms > protein coding biotype > severity. See the VEP documentation for the full selection order.--pick_allele: Selects one consequence per variant allele--pick_allele_gene: Selects one consequence per variant allele per gene--per_gene: Selects one consequence per gene--flag_pick: Flags the selected consequence while retaining others
MANE Select (Matched Annotation from NCBI and EMBL-EBI) transcripts represent the consensus between Ensembl/GENCODE and RefSeq for each protein-coding gene. Since VEP v106, MANE Select is the top priority in --pick selection for human data. When reporting variants clinically, always prefer the MANE Select transcript over older “canonical” designations.
Consequence Priority
VEP ranks consequences by severity: Figure 1: Image from the Ensembl VEP website.
| Consequence | Impact | Description |
|---|---|---|
| Transcript ablation | HIGH | Complete deletion of a gene transcript |
| Splice acceptor variant | HIGH | Variant at the 3’ end of an intron that disrupts RNA splicing |
| Splice donor variant | HIGH | Variant at the 5’ end of an intron that disrupts RNA splicing |
| Stop gained | HIGH | Variant that introduces a premature stop codon |
| Frameshift variant | HIGH | Insertion/deletion causing reading frame disruption |
| Stop lost | HIGH | Variant removes stop codon, extending protein length |
| Start lost | HIGH | Loss of start codon prevents proper translation initiation |
| Transcript amplification | HIGH | Increased copy number of transcript regions |
| Inframe insertion | MODERATE | Insertion of bases in multiples of 3, no frameshift |
| Inframe deletion | MODERATE | Deletion of bases in multiples of 3, no frameshift |
| Missense variant | MODERATE | Single nucleotide change causing amino acid substitution |
| Protein altering variant | MODERATE | Variant changes protein sequence |
Practical Exercises
In case that you do not have the Mutect2 call and/or the CNVkit call, you can obtain those from here. In this folder there are also a README.md and some folders containing preprocessed data for the additional resources that we will use with Ensembl-VEP. The script that is used to download all the databases resources mentioned in the exercises is in the course repo’s scripts folder.
If you are following along the course you should download and extract the data into your $HOME/project/ folder.
--cache option
By default VEP will install the cache and plugins into your home folder (~/.vep/) and if that is the case, then you can leave it like --cache. - If you are following along the course, you need to specify the cache location as --cache ${HOME}/project/data/.vep (which accesses /data/.vep via the symbolic link you created).
Exercise 1
Objective: Annotate a small VCF file and understand the basic output format.
Input Data:
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO
chr7 55242465 rs121913529 A T . PASS .
chr17 7577121 rs28934578 C T . PASS .
chr13 32936646 rs28897743 C T . PASS .
chr17 7674894 rs28934574 G A . PASS .
chr13 32914438 rs80357090 T C . PASS .
chr17 41245466 rs28897686 G A . PASS .
chr3 178936091 rs63751289 G A . PASS .
chr10 89624230 rs61751507 G A . PASS .
chr11 108098576 rs386833395 G A . PASS .
chr4 55141055 rs1800744 A G . PASS .
chr1 11190510 rs6603781 G A . PASS .
chr7 140453136 rs121913530 A T . PASS .
chr5 112175770 rs121913343 C T . PASS .
chr17 43094464 rs28897672 A C . PASS .
Task: Annotate these variants using VEP and identify which genes are affected.
Hint: You can generate your own VCF (pasting the command below in your terminal) or you can download it from here Download the vcf file
This script creates a local VCF file, based on the VCF v4.2 standards.
echo -e '##fileformat=VCFv4.2
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO
chr7\t55242465\trs121913529\tA\tT\t.\tPASS\t.
chr17\t7577121\trs28934578\tC\tT\t.\tPASS\t.
chr13\t32936646\trs28897743\tC\tT\t.\tPASS\t.
chr17\t7674894\trs28934574\tG\tA\t.\tPASS\t.
chr13\t32914438\trs80357090\tT\tC\t.\tPASS\t.
chr17\t41245466\trs28897686\tG\tA\t.\tPASS\t.
chr3\t178936091\trs63751289\tG\tA\t.\tPASS\t.
chr10\t89624230\trs61751507\tG\tA\t.\tPASS\t.
chr11\t108098576\trs386833395\tG\tA\t.\tPASS\t.
chr4\t55141055\trs1800744\tA\tG\t.\tPASS\t.
chr1\t11190510\trs6603781\tG\tA\t.\tPASS\t.
chr7\t140453136\trs121913530\tA\tT\t.\tPASS\t.
chr17\t43094464\trs28897672\tA\tC\t.\tPASS\t.
chr5\t112175770\trs121913343\tC\tT\t.\tPASS\t.' > cancer_variants.vcfVEP can be provided with a VCF file as an input (or vcf.gz + index) file or other formats, see documentation here
vep -i cancer_variants.vcf \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline --format vcf \
--symbol \
--force_overwrite \
--domains \
--sift b \
--polyphen b \
--output_file output.txt#!/usr/bin/env bash
# Make sure you have 14 variants
grep -v '^#' cancer_variants.vcf | wc -l # it should show 14
# Run VEP ~< 5min
vep -i cancer_variants.vcf \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--offline \
--force_overwrite \
--assembly GRCh38 \
--format vcf \
--symbol \
--sift b \
--polyphen b \
--pick \
--domains \
--output_file output.txtSome of the variants affect:
## VEP command-line: vep --assembly GRCh38 --cache /data/.vep --database 0 --force_overwrite --format vcf --input_file cancer_variants.vcf --offline --output_file output.txt --pick --symbol
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
rs121913529 chr7:55242465 T ENSG00000280890 ENST00000626532 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=-1;SYMBOL=ELDR;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:49511
rs28934578 chr17:7577121 T ENSG00000161960 ENST00000293831 Transcript missense_variant 597 580 194 D/Y Gac/Tac - IMPACT=MODERATE;STRAND=1;SYMBOL=EIF4A1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:3282
rs28897743 chr13:32936646 T - - - intergenic_variant - - - - - - IMPACT=MODIFIER
rs28934574 chr17:7674894 A ENSG00000141510 ENST00000269305 Transcript stop_gained 779 637 213 R/* Cga/Tga - IMPACT=HIGH;STRAND=-1;SYMBOL=TP53;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:11998
rs28897686 chr17:41245466 A ENSG00000241595 ENST00000334109 Transcript upstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=4221;STRAND=1;SYMBOL=KRTAP9-4;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18902
rs28934574 from this example is a TP53 stop gained mutation commonly found in cancer.
You can try to run without --pick and --everything see what happens.
Here you can see the html summary file generated:
After running the command you will also notice a output.txt_summary.html that will show some statistics on the annotation process and the results visually (see solution above). If you tried running --everything or -e this is a shortcut for the following options:
--sift b, --polyphen b, --ccds, --hgvs, --symbol, --numbers, --domains, --regulatory, --canonical, --protein, --biotype, --af, --af_1kg, --af_esp, --af_gnomade, --af_gnomadg, --max_af, --pubmed, --uniprot, --mane, --tsl, --appris, --variant_class, --gene_phenotype, --mirna
This an example of the highput from the TP53 mutation
| Field | Value | Description |
|---|---|---|
| Uploaded Variation | rs28934574 | dbSNP identifier |
| Location | chr17:7674894 | Genomic coordinates (GRCh38) |
| Allele | A | Alternative allele |
| Gene | ENSG00000141510 | Ensembl gene ID for TP53 |
| Transcript | ENST00000455263 | Ensembl transcript ID |
| Feature Type | Transcript | Type of genomic feature affected |
| Consequence | stop_gained | Creates premature stop codon |
| cDNA Position | 770 | Position in the cDNA |
| CDS Position | 637 | Position in the coding sequence |
| Protein Position | 213 | Position in the protein |
| Amino Acids | R/* | Change from Arginine to stop codon |
| Codons | Cga/Tga | Codon change |
| Impact | HIGH | Severity of variant impact |
| Symbol | TP53 | HGNC gene symbol |
| Biotype | protein_coding | Type of transcript |
| CCDS | CCDS45605.1 | Consensus CDS identifier |
| UniProt | P04637.305 | UniProt protein identifier |
| HGVSc | ENST00000455263.6:c.637C>T | Coding sequence change in HGVS notation |
| HGVSp | ENSP00000398846.2:p.Arg213Ter | Protein change in HGVS notation |
| Clinical Significance | pathogenic | Clinical interpretation |
| gnomAD AF | 1.368e-06 | Allele frequency in gnomAD |
| Maximum AF | 1.656e-05 | Maximum allele frequency observed |
You can have a look at the documentation to learn more about VEP’s many options here
SIFT (Sorting Intolerant From Tolerant) and PolyPhen (Polymorphism Phenotyping) are both prediction tools specifically designed to assess the functional impact of missense variants—those that result in amino acid substitutions
- SIFT: Based on sequence homology and physical properties of amino acids
- PolyPhen: Combines sequence-based features with structural information
Mutect2 results annotation
The previous section demonstrated how to run VEP and how different arguments significantly impact the output. Now, let’s apply this to a real dataset. We will take the table of somatic variants identified from the WES of chr6 and chr17 and use VEP to enrich it with functional annotations.
Extract from the resulting Mutect2 somatic call
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO
chr17 745040 . G T . PASS
chr17 1492466 . C T . PASS
chr17 2364457 . T G . PASS
chr17 3433669 . G T . PASS
chr17 4545725 . T C . PASS
chr17 4934441 . G T . PASS
chr17 5007047 . C T . PASS
chr17 7560755 . G A . PASS
chr17 7588500 . G C . PASS
chr6 325357 . G A . PASS
chr6 325783 . G C . PASS
chr6 325901 . G A . PASS
chr6 1916336 . G A . PASS
chr6 2674999 . A C . PASS
chr6 2834013 . C G . PASS
chr6 2840548 . C T . PASS
chr6 3723708 . C G . PASS
Using VEP to annotate Mutect2 variants
The somatic variant call can be annotated with this command
PROJECT="${HOME}/project"
VARIANTDIR="${HOME}/project/variants"
VEPDBDIR="${PROJECT}/data/VEP_dbs"
vep -i "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline \
--format vcf \
--symbol \
--force_overwrite \
--everything \
--output_file "$VARIANTDIR"/chr6ch17_mutect2_annotated.txtWe are using as input a vcf.gz file (which has been indexed), and we are providing the --format vcf command, this will output by default a tsv file
Exercise 2
What if the output of this file should be formatted as vcf instead? You can check VEP’s documentation here
hint: use --vcf
With the option --output_file you can use vcf. After you can compress the VCF bgzip <output_file.vcf> and then you can index it tabix -p vcf <output_file.vcf.gz>. If you see VEP’s documentation there is an option to skip bgzip and instead incorporating it directly in VEP, which will be doing the compression for you --compress_output bgzip in the call. Afterwards you still need to index the file.
Interesting results: We will obtain a file with many annotations.
Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
chr6_1930220_GA/TT chr6:1930220-1930221 TT ENSG00000112699 ENST00000380815 Transcript stop_gained 836-837 653-654 218 F/* tTC/tAA - IMPACT=HIGH;STRAND=-1;SYMBOL=GMDS;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4369
| Impact | Gene | Location | Consequence | Change | Details |
|---|---|---|---|---|---|
| HIGH | GMDS | chr6:1930220 | stop_gained | F→* | Creates premature stop codon |
| Impact | Gene | Location | Consequence | Change | Details |
|---|---|---|---|---|---|
| HIGH | SERPINB1 | chr6:2834013 | splice_acceptor_variant | C>G | Affects splice site |
| MODERATE | OR1E2 | chr17:3433669 | missense_variant | P→H | Position 58 |
| MODERATE | PLD2 | chr17:4818362 | inframe_deletion | VDRILK→V | 15bp deletion |
| LOW | GEMIN4 | chr17:745040 | synonymous_variant | L→L | No amino acid change |
| MODIFIER | DUSP22 | chr6:325357 | intron_variant | G>A | Within intron |
Can you run the same command as before adding --fork 2, do you note any difference?
VARIANTDIR="${HOME}/project/variants"
vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline \
--format vcf \
--symbol \
--force_overwrite \
--everything \
--fork 2 \
--output_file "$VARIANTDIR/chr6ch17_mutect2_annotated.txt"Filtering VEP output
Our output file chr6ch17_mutect2_annotated.txt has about 160 lines for roughly 20 variants. This is expected: VEP writes one row per transcript per variant, so a single mutation affecting five transcripts produces five rows. Before drawing any conclusions we need to reduce this to the variants that are most likely to be functionally relevant.
filter_vep is a companion script bundled with VEP, so no separate installation is needed. It reads the tab-delimited output and applies logic expressions against any field in the Extra column. The operators is and match test for exact or partial string matches; combining conditions with or is permissive (keeps a variant if any criterion is met), while and is stringent (all criteria must hold).
VARIANTDIR="${HOME}/project/variants"
filter_vep \
-i "$VARIANTDIR"/chr6ch17_mutect2_annotated.txt \
-o "$VARIANTDIR"/chr6ch17_mutect2_annotated_filtered.txt \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) or (SIFT match deleterious or PolyPhen match probably_damaging)"in this line --filter "(IMPACT is HIGH or IMPACT is MODERATE) or (SIFT match deleterious or PolyPhen match probably_damaging)" consider that there is a difference in saying or or and as the latter is much more stringent and it would look only for variants whose prediction is in agreement. The latter (and) is generally preferred, because it reduces false positives. If you try do you see the difference?
Integrating external databases
The VEP cache already provides a broad annotation layer, but for clinical and cancer genomics work it is often necessary to cross-reference variants against curated external databases. VEP supports this through the --custom flag, which allows you to attach any tabix-indexed VCF or BED file to the annotation run.
In Exercise 3 we integrate ClinVar, a freely accessible archive maintained by NCBI that collects interpretations of variants submitted by clinical laboratories and research groups worldwide. ClinVar adds three fields to the output: the aggregate clinical significance (CLNSIG), the review status indicating the level of evidence (CLNREVSTAT), and the associated disease name (CLNDN). This allows us to cross-check our somatic calls against known germline pathogenic variants and assess whether any of our findings have prior clinical relevance.
Exercise 3
Objective: Add cancer-specific annotations using ClinVar and filter for potentially pathogenic variants.
Task: Enhance the previous analysis by:
- Adding ClinVar annotations
- Filtering for HIGH/MODERATE impact variants
- Creating a summary of mutation types
You can download from here
# Download latest ClinVar VCF
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
# Download ClinVar VCF
VARIANTDIR="${HOME}/project/variants"
RESOURCEDIR="${HOME}/project/data/VEP_dbs"
# Run VEP with cancer-specific annotations (ClinVar), this takes a while⏰
vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
--cache ${HOME}/project/data/.vep \
--assembly GRCh38 \
--offline \
--format vcf \
--force_overwrite \
--fork 2 \
--everything \
--custom "$RESOURCEDIR"/clinvar/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
--output_file "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar.txt"# Filter results
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar.txt" \
-o "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) or \
(SIFT match deleterious or PolyPhen match probably_damaging) or \
(ClinVar_CLNSIG match pathogenic)"# Generate summary (using simple grep/awk commands)
echo "=== Mutation Type Summary ===" > "$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt
grep -v "#" "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt" | cut -f7 | sort | uniq -c >> "$VARIANTDIR/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt"# Summary for ClinVar clinical significance
echo "=== ClinVar Clinical Significance ===" >> "$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt
grep -v "^#" "$VARIANTDIR"/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt | \
grep -oP 'ClinVar_CLNSIG=[^;|]+' | cut -d'=' -f2 | sort | uniq -c | sort -rn >> \
"$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txtThe possible values for ClinVar_CLNSIG include:
- Pathogenic - Strong evidence the variant causes disease
- Likely_pathogenic - Evidence suggests it is probably disease-causing
- Uncertain_significance (VUS) - Not enough evidence to classify either way
- Likely_benign - Evidence suggests it is probably not disease-causing
- Benign - Strong evidence the variant is not disease-causing
- Conflicting_interpretations_of_pathogenicity - Different submitters disagree
- Not_provided - No interpretation was provided
- Drug_response - Affects response to medications
- Other - Other interpretations
What happens if you increase the stringency of your filter_vep step?
The clinvar data provided us with more information coming from curated databases, and for some variants we can have multiple levels of consensus (however this is not a guarantee of pathogenicity), but there are many other additional databases that one can try. However not all the databases have the same licensing, therefore each of the integrated datasets should be checked. For example COSMIC databases can be accessed only after registration, onkoKB as well.
Please do not forget to cite the data that you use and it is important to keep note of the version used. This is also an essential information for your future you!
Advanced Topics
Using VEP Plugins
VEP’s functionality can be extended through plugins. Here are some essential plugins for cancer analysis:
- REVEL: Rare Exome Variant Ensemble Learner
REVEL paper: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants
REVEL
You can download here the data (~6.5GB): https://sites.google.com/site/revelgenomics/downloads or https://zenodo.org/records/7072866
unzip revel-v1.3_all_chromosomes.zip
cat revel_with_transcript_ids | tr "," "\t" > tabbed_revel.tsv
sed '1s/.*/#&/' tabbed_revel.tsv > new_tabbed_revel.tsv
bgzip new_tabbed_revel.tsvPrepare for GRCh38
zcat new_tabbed_revel.tsv.gz | head -n1 > h
zgrep -h -v ^#chr new_tabbed_revel.tsv.gz | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat h - | bgzip -c > new_tabbed_revel_grch38.tsv.gz
tabix -f -s 1 -b 3 -e 3 new_tabbed_revel_grch38.tsv.gzUsage:
--plugin REVEL,file=/path/to/revel/data.tsv.gz- AlphaMissense: Deep learning-based predictor
AlphaMissense’s Paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense
AlphaMissense was archived on GitHub around May 2025, but it is still a gold standard, with its 216 millions predicted variants effect
AlphaMissense
Prepare for GRCh38
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gzRun it with VEP
--plugin AlphaMissense,file=/full/path/to/file.tsv.gz- ClinVar ClinVar: A curated database with mutation->disease data
ClinVar Custom Annotation: ClinVar provides curated information about relationships between variants and human health.
Download ClinVar for GRCh38:
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbiUsage with VEP:
--custom /path/to/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDNKey fields extracted:
CLNSIG: Clinical significance (Pathogenic, Benign, VUS, etc.)CLNREVSTAT: Review status (confidence level)CLNDN: Condition/disease name
Clinical Significance Values:
- Pathogenic - Strong evidence the variant causes disease
- Likely_pathogenic - Evidence suggests disease-causing
- Uncertain_significance (VUS) - Insufficient evidence
- Likely_benign - Evidence suggests not disease-causing
- Benign - Strong evidence variant is not disease-causing
All three resources (REVEL, AlphaMissense, and ClinVar) can be downloaded and prepared automatically using our setup script: Download the setup script
bash 01_download_all_VEP_dbs.shThis will create the following structure:
/data/VEP_dbs/
├── clinvar/
│ ├── clinvar.vcf.gz
│ └── clinvar.vcf.gz.tbi
├── alphamissense/
│ ├── AlphaMissense_hg38.tsv.gz
│ └── AlphaMissense_hg38.tsv.gz.tbi
└── revel/
├── new_tabbed_revel_grch38.tsv.gz
└── new_tabbed_revel_grch38.tsv.gz.tbi
We have seen how to run VEP and how to filter the output. Now we can make another call on the Mutect2 output to annotate the variants using a combination of some plugins and some custom databases.
- REVEL: An ensemble method that integrates multiple prediction tools
- AlphaMissense: Uses deep learning trained on evolutionary data
- ClinVar: Provides clinical interpretations from expert curation
Complex VEP annotation
Here we learn how to use additional data to include additional information to the annotation
EXTRA: Exercise 4
Objective: Learn how to comprehensively annotate cancer variants using VEP’s advanced features, plugins, and filtering capabilities to identify potentially pathogenic variants.
Task: Enhance the previous analysis by:
- Adding ClinVar annotations
- Adding REVEL annotations
- Adding AlphaMissense
- Adding population frequencies
--af_gnomade(gnomAD data is included in VEP cache) - Adding plugins (
--plugin SpliceRegion,--plugin AlphaMissense,--plugin REVEL) - Filtering for HIGH/MODERATE impact variants
VARIANTDIR="${HOME}/project/variants"
RESOURCEDIR="${HOME}/project/data/VEP_dbs"
vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
--cache ${HOME}/project/data/.vep \
--assembly GRCh38 \
--offline \
--format vcf \
--fork 2 \
--everything \
--af_gnomade \
--force_overwrite \
--plugin SpliceRegion \
--plugin REVEL,file="$RESOURCEDIR/revel/new_tabbed_revel_grch38.tsv.gz" \
--plugin AlphaMissense,file="$RESOURCEDIR/alphamissense/AlphaMissense_hg38.tsv.gz" \
--custom "$RESOURCEDIR/clinvar/clinvar.vcf.gz",ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
--output_file "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt"
# Filter the data -> Moderate (discovery)
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt" \
-o "$VARIANTDIR/moderate_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) and \
(REVEL >= 0.5 or \
am_class = 'likely_pathogenic' or \
SIFT = 'deleterious' or \
PolyPhen = 'probably_damaging')"
# How many events do you get?
grep -v "^#" "$VARIANTDIR"/moderate_filtered.txt | wc -l #19
# Filter the data -> Stringent (high confidence)
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt" \
-o "$VARIANTDIR/stringent_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) and \
(REVEL >= 0.75 or \
am_class = 'likely_pathogenic' or \
(SIFT = 'deleterious' and PolyPhen = 'probably_damaging'))"
# How many events do you get?
grep -v "^#" "$VARIANTDIR"/stringent_filtered.txt | wc -l #14You could also include and (gnomADe_AF < 0.001 or not gnomADe_AF) to filter likely germline variants. For somatic analysis, a threshold of 0.1% (0.001) is standard — variants above this frequency in gnomAD are almost certainly germline. The less stringent threshold of 1% (0.01) is more appropriate for germline pathogenicity assessment.
exact vs overlap in VEP –custom
- exact: Only matches if all coordinates are identical
- overlap: Matches if regions overlap
For example you can use overlap for indels, exact for strict SNV matching.
A glimpse on the results
Variant Details
chr6_10801908_C/G chr6:10801908 G ENSG00000111837 ENST00000313243 Transcript missense_variant 1198 815 272 R/P cGa/cCa COSV57568710 IMPACT=MODERATE;STRAND=-1;VARIANT_CLASS=SNV;SYMBOL=MAK;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:6816;BIOTYPE=protein_coding;TSL=5;APPRIS=A1;CCDS=CCDS4516.1;ENSP=ENSP00000313021;SWISSPROT=P20794.202;TREMBL=A0A140VK28.35;UNIPARC=UPI0000001BCD;UNIPROT_ISOFORM=P20794-1;GENE_PHENO=1;SIFT=deleterious(0);PolyPhen=probably_damaging(1);EXON=8/14;DOMAINS=Gene3D:1.10.510.10,AFDB-ENSP_mappings:AF-P20794-F1,Pfam:PF00069,PROSITE_profiles:PS50011,PANTHER:PTHR24055,SMART:SM00220,Superfamily:SSF56112,CDD:cd07830;HGVSc=ENST00000313243.6:c.815G>C;HGVSp=ENSP00000313021.2:p.Arg272Pro;SOMATIC=1;PHENO=1;REVEL=0.884;am_class=likely_pathogenic;am_pathogenicity=0.9946
chr6_10801908_C/G chr6:10801908 G ENSG00000111837 ENST00000354489 Transcript missense_variant 1081 815 272 R/P cGa/cCa COSV57568710 IMPACT=MODERATE;STRAND=-1;VARIANT_CLASS=SNV;SYMBOL=MAK;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:6816;BIOTYPE=protein_coding;CANONICAL=YES;MANE=MANE_Select;MANE_SELECT=NM_001242957.3;TSL=5;APPRIS=P4;CCDS=CCDS75399.1;ENSP=ENSP00000346484;SWISSPROT=P20794.202;UNIPARC=UPI000217CBBA;UNIPROT_ISOFORM=P20794-2;GENE_PHENO=1;SIFT=deleterious(0);PolyPhen=probably_damaging(0.999);EXON=8/15;DOMAINS=Gene3D:1.10.510.10,AFDB-ENSP_mappings:AF-P20794-F1,Pfam:PF00069,PROSITE_profiles:PS50011,PANTHER:PTHR24055,SMART:SM00220,Superfamily:SSF56112,CDD:cd07830;HGVSc=ENST00000354489.7:c.815G>C;HGVSp=ENSP00000346484.3:p.Arg272Pro;SOMATIC=1;PHENO=1;REVEL=0.884;am_class=likely_pathogenic;am_pathogenicity=0.9946
| Category | Information | Description |
|---|---|---|
| Basic Information | ||
| Genomic Location | chr6:10801908 | GRCh38 coordinates |
| Variant Type | SNV (C>G) | Single nucleotide variant |
| Gene | MAK | Male Germ Cell-Associated Kinase |
| Impact | MODERATE | VEP impact classification |
Transcript Effects
Transcript 1 (ENST00000313243)
| Feature | Value | Note |
|---|---|---|
| Consequence | missense_variant | Changes amino acid |
| Protein Change | p.Arg272Pro | Arginine to Proline |
| Transcript Support | TSL=5, APPRIS=A1 | Low transcript support level |
| SIFT | deleterious(0) | Maximum deleteriousness score |
| PolyPhen | probably_damaging(1) | Maximum damaging score |
| Affected Domain | PF00069 (Protein kinase domain) | Functional domain impact |
Transcript 2 (ENST00000354489)
| Feature | Value | Note |
|---|---|---|
| Consequence | missense_variant | Changes amino acid |
| Protein Change | p.Arg272Pro | Arginine to Proline |
| Transcript Status | CANONICAL, MANE_Select | Principal transcript |
| SIFT | deleterious(0) | Maximum deleteriousness score |
| PolyPhen | probably_damaging(0.999) | Near maximum damaging score |
| Affected Domain | PF00069 (Protein kinase domain) | Functional domain impact |
Prediction Tools
- REVEL score: 0.884 (calibrated thresholds from Pejaver et al. 2022, AJHG: >=0.773 supports Pathogenic/Moderate, >=0.932 supports Pathogenic/Strong)
- AlphaMissense:
- Class: likely_pathogenic
- Pathogenicity score: 0.9946 (very high confidence)
- SpliceRegion: No splice site effects detected
- ClinVar: No clinical data found in ClinVar database
- SIFT: SIFT=deleterious(0)
- PolyPhen: PolyPhen=probably_damaging(1)
- IMPACT: MODERATE
Additional Annotations
| Feature | Value | Description |
|---|---|---|
| COSMIC ID | COSV57568710 | Catalogued in COSMIC database |
| Somatic Status | SOMATIC=1 | Confirmed somatic variant |
| Phenotype | PHENO=1 | Associated with phenotype |
| UniProt | P20794.202 | Protein database reference |
At this point we obtained a good annotated set of somatic variants. This paves the way for further analysis. For example you can add more databases to your annotation call, such as COSMIC, cancerhotspots, onkoKB etc. You can also try other plugins, depending on your biological question(s). Those can add additional data and possibly additional filtering options.
- What happens if you turn it to
exactinstead ofoverlap? - Add other databases
- Try additional plugins
- Find interesting genes that could be drivers in your samples
- Consider population frequency of the mutation(s) (common variants are less likely to be pathogenic)
- Consider possible drug-gene interactions
- Check in the literature
- Visualize the results
CNV annotation is more challenging than SNV annotation because:
- Tools like REVEL/AlphaMissense are designed for SNVs
- CNV breakpoints can be imprecise
- Functional impact affects multiple genes
For those interested in CNV annotation, expanded materials are available in the advanced resources section.
SpliceAI (Jaganathan et al., Cell 2019): A deep learning tool that predicts cryptic splice effects up to 50 bp from exon boundaries — a category of splice-altering variants that the SpliceRegion plugin misses entirely. Precomputed scores are available as a VEP plugin (~30 GB). Given that TNBC shows high rates of splice-disrupting mutations in tumour suppressors, SpliceAI is a valuable addition for clinical-grade annotation pipelines. See the VEP plugin documentation.
Note on SIFT/PolyPhen in somatic analysis: Both SIFT and PolyPhen were trained and validated on germline variants. For somatic missense mutations in cancer, ensemble methods like REVEL and AlphaMissense are better calibrated. Consider SIFT/PolyPhen as supporting evidence rather than primary predictors when interpreting cancer variants.
Next steps:
- VAF Analysis — Explore the Variant Allele Frequency distribution of your somatic calls. Connect what Mutect2 reported to tumor purity, clonality, and the biology discussed in Day 2.
- R Analysis & Visualization — Visualize and explore the VEP annotation results in R.