Day 3: Variant Annotation with VEP

Presentation

Ensembl VEP

VEP is a powerful and flexible tool for annotating, analyzing, and prioritizing genomic variants. It is particularly useful in cancer genomics, where we need to understand the potential functional impact of somatic mutations for research and clinical decisions.

With VEP we can start to address questions such as:

How damaging is a specific somatic mutation according to reference databases?
Is the mutation already known in this particular cancer type?
What is the predicted functional impact of this mutation in the tumour context?
Are there known associations between this mutation and specific drugs or drug classes?
How common the mutation is in various populations?

Why do we need a tool like VEP?

In a typical cancer exome or panel we may obtain thousands of somatic SNVs and indels per sample. These variants:

Are not analyzed one by one by hand.
Need to be standardized (same genome build, same transcript model, consistent consequence terms).
Need to be prioritized so that we can focus on potentially actionable or biologically relevant changes.

A tool like VEP allows us to:

Run high‑throughput annotation of all variants in a VCF.
Attach consistent functional predictions and database links (ClinVar, COSMIC, etc.).
Reduce a long list of raw calls to a shorter list of interpretable candidate variants for downstream review.

Somatic variant classification in clinical practice

In clinical oncology, somatic variants are classified using the AMP/ASCO/CAP framework (Li et al., 2017) — a four-tier system based on their clinical relevance:

Tier	Category	Description
I	Strong clinical significance	FDA-approved therapy or included in professional guidelines for this tumor type
II	Potential clinical significance	FDA-approved therapy for a different tumor type, or investigational (clinical trials, small studies)
III	Unknown clinical significance	Not observed in cancer databases, or no convincing published evidence of cancer association
IV	Benign or likely benign	Observed at significant frequency in population databases (e.g. gnomAD), with no published cancer association

This is different from the ACMG/AMP 5-tier system used for germline variants. In this course, the annotation and filtering steps you will perform (impact prediction, ClinVar lookup, population frequency filtering, REVEL/AlphaMissense scores) are the building blocks for this kind of clinical tiering — though formal classification requires additional clinical context beyond bioinformatics alone.

Reference: Li MM, et al. Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer. J Mol Diagn (2017) 19(1):4-23. doi:10.1016/j.jmoldx.2016.10.002

Key Concepts

Install VEP

VEP Version

This course uses VEP v115 (2025). Some features and default behaviors may differ in other versions. Always check the release notes for your version.

Setup

All the environment is prepared for you, you just need to create a symbolic link to access the data from your project directory:

cd ~/project
ln -s /data .

If you want to install VEP locally or on your compute cluster, you can follow the instructions below.

# Installing Conda #https://docs.anaconda.com/miniconda/
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all

# Using Conda
conda create -n vep samtools python=3.10 ensembl-vep=115.0
# conda install -c bioconda ensembl-vep=115.0 # if installing in current env

# Activate VEP's env
conda activate vep

# PLEASE DO NOT RUN During the course, only if you follow locally (~60 min, > 25GB)
# Install all the data locally
#vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT (this command will download in $HOME/.vep the cache folder)
#vep_install -a cf -s homo_sapiens -y GRCh38 -c $HOME/project/data/vep --CONVERT (this specify the cache folder)

# Plugins installation
# vep_install -a p --PLUGINS all

Instead of conda you could try installing mamba. If you are on Unix-based system this is simple mamba. . In case you are following locally, you can environment with mamba activate ngs-tools.

Automated download script section below.

Transcript Selection

VEP can report consequences for all transcripts overlapping a variant. However, this often results in multiple annotations per variant. Several approaches are available to handle this:

--pick: Selects one consequence per variant based on a predefined hierarchy. Since VEP v106+, the hierarchy prioritises MANE Select > MANE Plus Clinical > Canonical > TSL 1 > APPRIS principal isoforms > protein coding biotype > severity. See the VEP documentation for the full selection order.
--pick_allele: Selects one consequence per variant allele
--pick_allele_gene: Selects one consequence per variant allele per gene
--per_gene: Selects one consequence per gene
--flag_pick: Flags the selected consequence while retaining others

MANE Select Transcripts

MANE Select (Matched Annotation from NCBI and EMBL-EBI) transcripts represent the consensus between Ensembl/GENCODE and RefSeq for each protein-coding gene. Since VEP v106, MANE Select is the top priority in --pick selection for human data. When reporting variants clinically, always prefer the MANE Select transcript over older “canonical” designations.

Consequence Priority

VEP ranks consequences by severity: Cancer somatic mut Figure 1: Image from the Ensembl VEP website.

Consequence	Impact	Description
Transcript ablation	HIGH	Complete deletion of a gene transcript
Splice acceptor variant	HIGH	Variant at the 3’ end of an intron that disrupts RNA splicing
Splice donor variant	HIGH	Variant at the 5’ end of an intron that disrupts RNA splicing
Stop gained	HIGH	Variant that introduces a premature stop codon
Frameshift variant	HIGH	Insertion/deletion causing reading frame disruption
Stop lost	HIGH	Variant removes stop codon, extending protein length
Start lost	HIGH	Loss of start codon prevents proper translation initiation
Transcript amplification	HIGH	Increased copy number of transcript regions
Inframe insertion	MODERATE	Insertion of bases in multiples of 3, no frameshift
Inframe deletion	MODERATE	Deletion of bases in multiples of 3, no frameshift
Missense variant	MODERATE	Single nucleotide change causing amino acid substitution
Protein altering variant	MODERATE	Variant changes protein sequence

Practical Exercises

In case that you do not have the Mutect2 call and/or the CNVkit call, you can obtain those from here. In this folder there are also a README.md and some folders containing preprocessed data for the additional resources that we will use with Ensembl-VEP. The script that is used to download all the databases resources mentioned in the exercises is in the course repo’s scripts folder.

If you are following along the course you should download and extract the data into your $HOME/project/ folder.

--cache option

By default VEP will install the cache and plugins into your home folder (~/.vep/) and if that is the case, then you can leave it like --cache. - If you are following along the course, you need to specify the cache location as --cache ${HOME}/project/data/.vep (which accesses /data/.vep via the symbolic link you created).

Exercise 1

Basic VEP Analysis

Objective: Annotate a small VCF file and understand the basic output format.

Input Data:

##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
chr7    55242465    rs121913529    A    T    .    PASS    .
chr17    7577121    rs28934578    C    T    .    PASS    .
chr13    32936646    rs28897743    C    T    .    PASS    .
chr17    7674894    rs28934574    G    A    .    PASS    .
chr13    32914438    rs80357090    T    C    .    PASS    .
chr17    41245466    rs28897686    G    A    .    PASS    .
chr3    178936091    rs63751289    G    A    .    PASS    .
chr10    89624230    rs61751507    G    A    .    PASS    .
chr11    108098576    rs386833395    G    A    .    PASS    .
chr4    55141055    rs1800744    A    G    .    PASS    .
chr1    11190510    rs6603781    G    A    .    PASS    .
chr7    140453136    rs121913530    A    T    .    PASS    .
chr5    112175770    rs121913343    C    T    .    PASS    .
chr17   43094464    rs28897672    A   C   .   PASS    .

Task: Annotate these variants using VEP and identify which genes are affected.

Hint: You can generate your own VCF (pasting the command below in your terminal) or you can download it from here Download the vcf file

Example: make your own VCF from terminal

This script creates a local VCF file, based on the VCF v4.2 standards.

echo -e '##fileformat=VCFv4.2
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO
chr7\t55242465\trs121913529\tA\tT\t.\tPASS\t.
chr17\t7577121\trs28934578\tC\tT\t.\tPASS\t.
chr13\t32936646\trs28897743\tC\tT\t.\tPASS\t.
chr17\t7674894\trs28934574\tG\tA\t.\tPASS\t.
chr13\t32914438\trs80357090\tT\tC\t.\tPASS\t.
chr17\t41245466\trs28897686\tG\tA\t.\tPASS\t.
chr3\t178936091\trs63751289\tG\tA\t.\tPASS\t.
chr10\t89624230\trs61751507\tG\tA\t.\tPASS\t.
chr11\t108098576\trs386833395\tG\tA\t.\tPASS\t.
chr4\t55141055\trs1800744\tA\tG\t.\tPASS\t.
chr1\t11190510\trs6603781\tG\tA\t.\tPASS\t.
chr7\t140453136\trs121913530\tA\tT\t.\tPASS\t.
chr17\t43094464\trs28897672\tA\tC\t.\tPASS\t.
chr5\t112175770\trs121913343\tC\tT\t.\tPASS\t.' > cancer_variants.vcf

Command

VEP can be provided with a VCF file as an input (or vcf.gz + index) file or other formats, see documentation here

vep -i cancer_variants.vcf \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline --format vcf \
--symbol \
--force_overwrite \
--domains \
--sift b \
--polyphen b \
--output_file output.txt

Solution

#!/usr/bin/env bash

# Make sure you have 14 variants
grep -v '^#' cancer_variants.vcf | wc -l  # it should show 14

# Run VEP ~< 5min
vep -i cancer_variants.vcf \
    --cache ${HOME}/project/data/.vep \
    --species homo_sapiens \
    --offline \
    --force_overwrite \
    --assembly GRCh38 \
    --format vcf \
    --symbol \
    --sift b \
    --polyphen b \
    --pick \
    --domains \
    --output_file output.txt

Some of the variants affect:

## VEP command-line: vep --assembly GRCh38 --cache /data/.vep --database 0 --force_overwrite --format vcf --input_file cancer_variants.vcf --offline --output_file output.txt --pick --symbol
#Uploaded_variation Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra
rs121913529 chr7:55242465   T   ENSG00000280890 ENST00000626532 Transcript  intron_variant,non_coding_transcript_variant    -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=ELDR;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:49511
rs28934578  chr17:7577121   T   ENSG00000161960 ENST00000293831 Transcript  missense_variant    597 580 194 D/Y Gac/Tac -   IMPACT=MODERATE;STRAND=1;SYMBOL=EIF4A1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:3282
rs28897743  chr13:32936646  T   -   -   -   intergenic_variant  -   -   -   -   -   -   IMPACT=MODIFIER
rs28934574  chr17:7674894   A   ENSG00000141510 ENST00000269305 Transcript  stop_gained 779 637 213 R/* Cga/Tga -   IMPACT=HIGH;STRAND=-1;SYMBOL=TP53;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:11998
rs28897686  chr17:41245466  A   ENSG00000241595 ENST00000334109 Transcript  upstream_gene_variant   -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=4221;STRAND=1;SYMBOL=KRTAP9-4;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18902

rs28934574 from this example is a TP53 stop gained mutation commonly found in cancer.

You can try to run without --pick and --everything see what happens.

Here you can see the html summary file generated:

VEP report

After running the command you will also notice a output.txt_summary.html that will show some statistics on the annotation process and the results visually (see solution above). If you tried running --everything or -e this is a shortcut for the following options:

-e or --everything

--sift b, --polyphen b, --ccds, --hgvs, --symbol, --numbers, --domains, --regulatory, --canonical, --protein, --biotype, --af, --af_1kg, --af_esp, --af_gnomade, --af_gnomadg, --max_af, --pubmed, --uniprot, --mane, --tsl, --appris, --variant_class, --gene_phenotype, --mirna

This an example of the highput from the TP53 mutation

Field	Value	Description
Uploaded Variation	rs28934574	dbSNP identifier
Location	chr17:7674894	Genomic coordinates (GRCh38)
Allele	A	Alternative allele
Gene	ENSG00000141510	Ensembl gene ID for TP53
Transcript	ENST00000455263	Ensembl transcript ID
Feature Type	Transcript	Type of genomic feature affected
Consequence	stop_gained	Creates premature stop codon
cDNA Position	770	Position in the cDNA
CDS Position	637	Position in the coding sequence
Protein Position	213	Position in the protein
Amino Acids	R/*	Change from Arginine to stop codon
Codons	Cga/Tga	Codon change
Impact	HIGH	Severity of variant impact
Symbol	TP53	HGNC gene symbol
Biotype	protein_coding	Type of transcript
CCDS	CCDS45605.1	Consensus CDS identifier
UniProt	P04637.305	UniProt protein identifier
HGVSc	ENST00000455263.6:c.637C>T	Coding sequence change in HGVS notation
HGVSp	ENSP00000398846.2:p.Arg213Ter	Protein change in HGVS notation
Clinical Significance	pathogenic	Clinical interpretation
gnomAD AF	1.368e-06	Allele frequency in gnomAD
Maximum AF	1.656e-05	Maximum allele frequency observed

You can have a look at the documentation to learn more about VEP’s many options here

Damage prediction tools

SIFT (Sorting Intolerant From Tolerant) and PolyPhen (Polymorphism Phenotyping) are both prediction tools specifically designed to assess the functional impact of missense variants—those that result in amino acid substitutions

SIFT: Based on sequence homology and physical properties of amino acids
PolyPhen: Combines sequence-based features with structural information

Mutect2 results annotation

The previous section demonstrated how to run VEP and how different arguments significantly impact the output. Now, let’s apply this to a real dataset. We will take the table of somatic variants identified from the WES of chr6 and chr17 and use VEP to enrich it with functional annotations.

Extract from the resulting Mutect2 somatic call

##fileformat=VCFv4.2
#CHROM  POS     ID      REF  ALT   QUAL  FILTER  INFO
chr17   745040    .   G   T   .   PASS
chr17   1492466   .   C   T   .   PASS
chr17   2364457   .   T   G   .   PASS
chr17   3433669   .   G   T   .   PASS
chr17   4545725   .   T   C   .   PASS
chr17   4934441   .   G   T   .   PASS
chr17   5007047   .   C   T   .   PASS
chr17   7560755   .   G   A   .   PASS
chr17   7588500   .   G   C   .   PASS
chr6    325357    .   G   A   .   PASS
chr6    325783    .   G   C   .   PASS
chr6    325901    .   G   A   .   PASS
chr6    1916336   .   G   A   .   PASS
chr6    2674999   .   A   C   .   PASS
chr6    2834013   .   C   G   .   PASS
chr6    2840548   .   C   T   .   PASS
chr6    3723708   .   C   G   .   PASS

Using VEP to annotate Mutect2 variants

The somatic variant call can be annotated with this command

PROJECT="${HOME}/project"
VARIANTDIR="${HOME}/project/variants"
VEPDBDIR="${PROJECT}/data/VEP_dbs"

vep -i "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline \
--format vcf \
--symbol \
--force_overwrite \
--everything \
--output_file "$VARIANTDIR"/chr6ch17_mutect2_annotated.txt

Note

We are using as input a vcf.gz file (which has been indexed), and we are providing the --format vcf command, this will output by default a tsv file

Exercise 2

Output to VCF

What if the output of this file should be formatted as vcf instead? You can check VEP’s documentation here
hint: use --vcf

Solution

With the option --output_file you can use vcf. After you can compress the VCF bgzip <output_file.vcf> and then you can index it tabix -p vcf <output_file.vcf.gz>. If you see VEP’s documentation there is an option to skip bgzip and instead incorporating it directly in VEP, which will be doing the compression for you --compress_output bgzip in the call. Afterwards you still need to index the file.

Interesting results: We will obtain a file with many annotations.

Uploaded_variation  Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra
chr6_1930220_GA/TT  chr6:1930220-1930221    TT  ENSG00000112699 ENST00000380815 Transcript  stop_gained 836-837 653-654 218 F/* tTC/tAA -   IMPACT=HIGH;STRAND=-1;SYMBOL=GMDS;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4369

Impact	Gene	Location	Consequence	Change	Details
HIGH	GMDS	chr6:1930220	stop_gained	F→*	Creates premature stop codon

And other interesting variants

Impact	Gene	Location	Consequence	Change	Details
HIGH	SERPINB1	chr6:2834013	splice_acceptor_variant	C>G	Affects splice site
MODERATE	OR1E2	chr17:3433669	missense_variant	P→H	Position 58
MODERATE	PLD2	chr17:4818362	inframe_deletion	VDRILK→V	15bp deletion
LOW	GEMIN4	chr17:745040	synonymous_variant	L→L	No amino acid change
MODIFIER	DUSP22	chr6:325357	intron_variant	G>A	Within intron

Performance Trick

Can you run the same command as before adding --fork 2, do you note any difference?

VARIANTDIR="${HOME}/project/variants"

vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
--cache ${HOME}/project/data/.vep \
--species homo_sapiens \
--assembly GRCh38 \
--offline \
--format vcf \
--symbol \
--force_overwrite \
--everything \
--fork 2 \
--output_file "$VARIANTDIR/chr6ch17_mutect2_annotated.txt"

Filtering VEP output

Our output file chr6ch17_mutect2_annotated.txt has about 160 lines for roughly 20 variants. This is expected: VEP writes one row per transcript per variant, so a single mutation affecting five transcripts produces five rows. Before drawing any conclusions we need to reduce this to the variants that are most likely to be functionally relevant.

filter_vep is a companion script bundled with VEP, so no separate installation is needed. It reads the tab-delimited output and applies logic expressions against any field in the Extra column. The operators is and match test for exact or partial string matches; combining conditions with or is permissive (keeps a variant if any criterion is met), while and is stringent (all criteria must hold).

VARIANTDIR="${HOME}/project/variants"

filter_vep \
-i "$VARIANTDIR"/chr6ch17_mutect2_annotated.txt \
-o "$VARIANTDIR"/chr6ch17_mutect2_annotated_filtered.txt \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) or (SIFT match deleterious or PolyPhen match probably_damaging)"

Caution

in this line --filter "(IMPACT is HIGH or IMPACT is MODERATE) or (SIFT match deleterious or PolyPhen match probably_damaging)" consider that there is a difference in saying or or and as the latter is much more stringent and it would look only for variants whose prediction is in agreement. The latter (and) is generally preferred, because it reduces false positives. If you try do you see the difference?

Integrating external databases

The VEP cache already provides a broad annotation layer, but for clinical and cancer genomics work it is often necessary to cross-reference variants against curated external databases. VEP supports this through the --custom flag, which allows you to attach any tabix-indexed VCF or BED file to the annotation run.

In Exercise 3 we integrate ClinVar, a freely accessible archive maintained by NCBI that collects interpretations of variants submitted by clinical laboratories and research groups worldwide. ClinVar adds three fields to the output: the aggregate clinical significance (CLNSIG), the review status indicating the level of evidence (CLNREVSTAT), and the associated disease name (CLNDN). This allows us to cross-check our somatic calls against known germline pathogenic variants and assess whether any of our findings have prior clinical relevance.

Exercise 3

Cancer-Specific Annotation

Objective: Add cancer-specific annotations using ClinVar and filter for potentially pathogenic variants.

Task: Enhance the previous analysis by:

Adding ClinVar annotations
Filtering for HIGH/MODERATE impact variants
Creating a summary of mutation types

You can download from here

# Download latest ClinVar VCF
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi

Solution

# Download ClinVar VCF

VARIANTDIR="${HOME}/project/variants"
RESOURCEDIR="${HOME}/project/data/VEP_dbs"

# Run VEP with cancer-specific annotations (ClinVar), this takes a while⏰
vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
    --cache ${HOME}/project/data/.vep \
    --assembly GRCh38 \
    --offline \
    --format vcf \
    --force_overwrite \
    --fork 2 \
    --everything \
    --custom "$RESOURCEDIR"/clinvar/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
    --output_file "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar.txt"

# Filter results
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar.txt" \
-o "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) or \
(SIFT match deleterious or PolyPhen match probably_damaging) or \
(ClinVar_CLNSIG match pathogenic)"

# Generate summary (using simple grep/awk commands)
echo "=== Mutation Type Summary ===" > "$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt

grep -v "#" "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt" | cut -f7 | sort | uniq -c >> "$VARIANTDIR/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt"

# Summary for ClinVar clinical significance
echo "=== ClinVar Clinical Significance ===" >> "$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt

grep -v "^#" "$VARIANTDIR"/chr6ch17_mutect2_annotated_with_clinVar_filtered.txt | \
  grep -oP 'ClinVar_CLNSIG=[^;|]+' | cut -d'=' -f2 | sort | uniq -c | sort -rn >> \
  "$VARIANTDIR"/summary_chr6ch17_mutect2_annotated_with_clinVar_filtered.txt

The possible values for ClinVar_CLNSIG include:

Pathogenic - Strong evidence the variant causes disease
Likely_pathogenic - Evidence suggests it is probably disease-causing
Uncertain_significance (VUS) - Not enough evidence to classify either way
Likely_benign - Evidence suggests it is probably not disease-causing
Benign - Strong evidence the variant is not disease-causing
Conflicting_interpretations_of_pathogenicity - Different submitters disagree
Not_provided - No interpretation was provided
Drug_response - Affects response to medications
Other - Other interpretations

What happens if you increase the stringency of your filter_vep step?

The clinvar data provided us with more information coming from curated databases, and for some variants we can have multiple levels of consensus (however this is not a guarantee of pathogenicity), but there are many other additional databases that one can try. However not all the databases have the same licensing, therefore each of the integrated datasets should be checked. For example COSMIC databases can be accessed only after registration, onkoKB as well.

Note

Please do not forget to cite the data that you use and it is important to keep note of the version used. This is also an essential information for your future you!

Advanced Topics

Using VEP Plugins

VEP’s functionality can be extended through plugins. Here are some essential plugins for cancer analysis:

REVEL: Rare Exome Variant Ensemble Learner

REVEL paper: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants

REVEL

You can download here the data (~6.5GB): https://sites.google.com/site/revelgenomics/downloads or https://zenodo.org/records/7072866

unzip revel-v1.3_all_chromosomes.zip
cat revel_with_transcript_ids | tr "," "\t" > tabbed_revel.tsv
sed '1s/.*/#&/' tabbed_revel.tsv > new_tabbed_revel.tsv
bgzip new_tabbed_revel.tsv

Prepare for GRCh38

zcat new_tabbed_revel.tsv.gz | head -n1 > h
zgrep -h -v ^#chr new_tabbed_revel.tsv.gz | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat h - | bgzip -c > new_tabbed_revel_grch38.tsv.gz
tabix -f -s 1 -b 3 -e 3 new_tabbed_revel_grch38.tsv.gz

Usage:

--plugin REVEL,file=/path/to/revel/data.tsv.gz

AlphaMissense: Deep learning-based predictor

AlphaMissense’s Paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense

AlphaMissense was archived on GitHub around May 2025, but it is still a gold standard, with its 216 millions predicted variants effect

AlphaMissense

Download link

Prepare for GRCh38

wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz

Run it with VEP

--plugin AlphaMissense,file=/full/path/to/file.tsv.gz

ClinVar ClinVar: A curated database with mutation->disease data

ClinVar Custom Annotation: ClinVar provides curated information about relationships between variants and human health.

Clinvar

Download ClinVar for GRCh38:

wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi

Usage with VEP:

--custom /path/to/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN

Key fields extracted:

CLNSIG: Clinical significance (Pathogenic, Benign, VUS, etc.)
CLNREVSTAT: Review status (confidence level)
CLNDN: Condition/disease name

Clinical Significance Values:

Pathogenic - Strong evidence the variant causes disease
Likely_pathogenic - Evidence suggests disease-causing
Uncertain_significance (VUS) - Insufficient evidence
Likely_benign - Evidence suggests not disease-causing
Benign - Strong evidence variant is not disease-causing

Automated Download Script

All three resources (REVEL, AlphaMissense, and ClinVar) can be downloaded and prepared automatically using our setup script: Download the setup script

bash 01_download_all_VEP_dbs.sh

This will create the following structure:

/data/VEP_dbs/
├── clinvar/
│   ├── clinvar.vcf.gz
│   └── clinvar.vcf.gz.tbi
├── alphamissense/
│   ├── AlphaMissense_hg38.tsv.gz
│   └── AlphaMissense_hg38.tsv.gz.tbi
└── revel/
    ├── new_tabbed_revel_grch38.tsv.gz
    └── new_tabbed_revel_grch38.tsv.gz.tbi

We have seen how to run VEP and how to filter the output. Now we can make another call on the Mutect2 output to annotate the variants using a combination of some plugins and some custom databases.

Damage prediction tools

REVEL: An ensemble method that integrates multiple prediction tools
AlphaMissense: Uses deep learning trained on evolutionary data
ClinVar: Provides clinical interpretations from expert curation

Complex VEP annotation

Here we learn how to use additional data to include additional information to the annotation

EXTRA: Exercise 4

Complex VEP run

Objective: Learn how to comprehensively annotate cancer variants using VEP’s advanced features, plugins, and filtering capabilities to identify potentially pathogenic variants.

Task: Enhance the previous analysis by:

Adding ClinVar annotations
Adding REVEL annotations
Adding AlphaMissense
Adding population frequencies --af_gnomade (gnomAD data is included in VEP cache)
Adding plugins (--plugin SpliceRegion, --plugin AlphaMissense, --plugin REVEL)
Filtering for HIGH/MODERATE impact variants

Solution

VARIANTDIR="${HOME}/project/variants"
RESOURCEDIR="${HOME}/project/data/VEP_dbs"

vep -i "$VARIANTDIR/somatic.filtered.PASS.vcf.gz" \
    --cache ${HOME}/project/data/.vep \
    --assembly GRCh38 \
    --offline \
    --format vcf \
    --fork 2 \
    --everything \
    --af_gnomade \
    --force_overwrite \
    --plugin SpliceRegion \
    --plugin REVEL,file="$RESOURCEDIR/revel/new_tabbed_revel_grch38.tsv.gz" \
    --plugin AlphaMissense,file="$RESOURCEDIR/alphamissense/AlphaMissense_hg38.tsv.gz" \
    --custom "$RESOURCEDIR/clinvar/clinvar.vcf.gz",ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN \
    --output_file "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt"

# Filter the data -> Moderate (discovery)
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt" \
-o "$VARIANTDIR/moderate_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) and \
         (REVEL >= 0.5 or \
          am_class = 'likely_pathogenic' or \
          SIFT = 'deleterious' or \
          PolyPhen = 'probably_damaging')"

# How many events do you get?
grep -v "^#" "$VARIANTDIR"/moderate_filtered.txt | wc -l #19

# Filter the data -> Stringent (high confidence)
filter_vep \
-i "$VARIANTDIR/chr6ch17_mutect2_annotated_with_clinVar_and_plugins.txt" \
-o "$VARIANTDIR/stringent_filtered.txt" \
--force_overwrite \
--filter "(IMPACT is HIGH or IMPACT is MODERATE) and \
  (REVEL >= 0.75 or \
  am_class = 'likely_pathogenic' or \
  (SIFT = 'deleterious' and PolyPhen = 'probably_damaging'))"

# How many events do you get?
grep -v "^#" "$VARIANTDIR"/stringent_filtered.txt | wc -l #14

You could also include and (gnomADe_AF < 0.001 or not gnomADe_AF) to filter likely germline variants. For somatic analysis, a threshold of 0.1% (0.001) is standard — variants above this frequency in gnomAD are almost certainly germline. The less stringent threshold of 1% (0.01) is more appropriate for germline pathogenicity assessment.

exact vs overlap in VEP –custom

exact: Only matches if all coordinates are identical
overlap: Matches if regions overlap

For example you can use overlap for indels, exact for strict SNV matching.

A glimpse on the results

Variant Details

chr6_10801908_C/G   chr6:10801908   G   ENSG00000111837 ENST00000313243 Transcript  missense_variant    1198    815 272 R/P cGa/cCa COSV57568710    IMPACT=MODERATE;STRAND=-1;VARIANT_CLASS=SNV;SYMBOL=MAK;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:6816;BIOTYPE=protein_coding;TSL=5;APPRIS=A1;CCDS=CCDS4516.1;ENSP=ENSP00000313021;SWISSPROT=P20794.202;TREMBL=A0A140VK28.35;UNIPARC=UPI0000001BCD;UNIPROT_ISOFORM=P20794-1;GENE_PHENO=1;SIFT=deleterious(0);PolyPhen=probably_damaging(1);EXON=8/14;DOMAINS=Gene3D:1.10.510.10,AFDB-ENSP_mappings:AF-P20794-F1,Pfam:PF00069,PROSITE_profiles:PS50011,PANTHER:PTHR24055,SMART:SM00220,Superfamily:SSF56112,CDD:cd07830;HGVSc=ENST00000313243.6:c.815G>C;HGVSp=ENSP00000313021.2:p.Arg272Pro;SOMATIC=1;PHENO=1;REVEL=0.884;am_class=likely_pathogenic;am_pathogenicity=0.9946
chr6_10801908_C/G   chr6:10801908   G   ENSG00000111837 ENST00000354489 Transcript  missense_variant    1081    815 272 R/P cGa/cCa COSV57568710    IMPACT=MODERATE;STRAND=-1;VARIANT_CLASS=SNV;SYMBOL=MAK;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:6816;BIOTYPE=protein_coding;CANONICAL=YES;MANE=MANE_Select;MANE_SELECT=NM_001242957.3;TSL=5;APPRIS=P4;CCDS=CCDS75399.1;ENSP=ENSP00000346484;SWISSPROT=P20794.202;UNIPARC=UPI000217CBBA;UNIPROT_ISOFORM=P20794-2;GENE_PHENO=1;SIFT=deleterious(0);PolyPhen=probably_damaging(0.999);EXON=8/15;DOMAINS=Gene3D:1.10.510.10,AFDB-ENSP_mappings:AF-P20794-F1,Pfam:PF00069,PROSITE_profiles:PS50011,PANTHER:PTHR24055,SMART:SM00220,Superfamily:SSF56112,CDD:cd07830;HGVSc=ENST00000354489.7:c.815G>C;HGVSp=ENSP00000346484.3:p.Arg272Pro;SOMATIC=1;PHENO=1;REVEL=0.884;am_class=likely_pathogenic;am_pathogenicity=0.9946

Category	Information	Description
Basic Information
Genomic Location	chr6:10801908	GRCh38 coordinates
Variant Type	SNV (C>G)	Single nucleotide variant
Gene	MAK	Male Germ Cell-Associated Kinase
Impact	MODERATE	VEP impact classification

Transcript Effects

Transcript 1 (ENST00000313243)

Feature	Value	Note
Consequence	missense_variant	Changes amino acid
Protein Change	p.Arg272Pro	Arginine to Proline
Transcript Support	TSL=5, APPRIS=A1	Low transcript support level
SIFT	deleterious(0)	Maximum deleteriousness score
PolyPhen	probably_damaging(1)	Maximum damaging score
Affected Domain	PF00069 (Protein kinase domain)	Functional domain impact

Transcript 2 (ENST00000354489)

Feature	Value	Note
Consequence	missense_variant	Changes amino acid
Protein Change	p.Arg272Pro	Arginine to Proline
Transcript Status	CANONICAL, MANE_Select	Principal transcript
SIFT	deleterious(0)	Maximum deleteriousness score
PolyPhen	probably_damaging(0.999)	Near maximum damaging score
Affected Domain	PF00069 (Protein kinase domain)	Functional domain impact

Prediction Tools

REVEL score: 0.884 (calibrated thresholds from Pejaver et al. 2022, AJHG: >=0.773 supports Pathogenic/Moderate, >=0.932 supports Pathogenic/Strong)
AlphaMissense:
- Class: likely_pathogenic
- Pathogenicity score: 0.9946 (very high confidence)
SpliceRegion: No splice site effects detected
ClinVar: No clinical data found in ClinVar database
SIFT: SIFT=deleterious(0)
PolyPhen: PolyPhen=probably_damaging(1)
IMPACT: MODERATE

Additional Annotations

Feature	Value	Description
COSMIC ID	COSV57568710	Catalogued in COSMIC database
Somatic Status	SOMATIC=1	Confirmed somatic variant
Phenotype	PHENO=1	Associated with phenotype
UniProt	P20794.202	Protein database reference

Note

At this point we obtained a good annotated set of somatic variants. This paves the way for further analysis. For example you can add more databases to your annotation call, such as COSMIC, cancerhotspots, onkoKB etc. You can also try other plugins, depending on your biological question(s). Those can add additional data and possibly additional filtering options.

What happens if you turn it to exact instead of overlap?
Add other databases
Try additional plugins
Find interesting genes that could be drivers in your samples
Consider population frequency of the mutation(s) (common variants are less likely to be pathogenic)
Consider possible drug-gene interactions
Check in the literature
Visualize the results

EXTRA: Example of CNVs annotation with VEP

CNV annotation is more challenging than SNV annotation because:

Tools like REVEL/AlphaMissense are designed for SNVs
CNV breakpoints can be imprecise
Functional impact affects multiple genes

For those interested in CNV annotation, expanded materials are available in the advanced resources section.

Additional VEP Plugins Worth Exploring

SpliceAI (Jaganathan et al., Cell 2019): A deep learning tool that predicts cryptic splice effects up to 50 bp from exon boundaries — a category of splice-altering variants that the SpliceRegion plugin misses entirely. Precomputed scores are available as a VEP plugin (~30 GB). Given that TNBC shows high rates of splice-disrupting mutations in tumour suppressors, SpliceAI is a valuable addition for clinical-grade annotation pipelines. See the VEP plugin documentation.

Note on SIFT/PolyPhen in somatic analysis: Both SIFT and PolyPhen were trained and validated on germline variants. For somatic missense mutations in cancer, ensemble methods like REVEL and AlphaMissense are better calibrated. Consider SIFT/PolyPhen as supporting evidence rather than primary predictors when interpreting cancer variants.

Next steps:

VAF Analysis — Explore the Variant Allele Frequency distribution of your somatic calls. Connect what Mutect2 reported to tumor purity, clonality, and the biology discussed in Day 2.
R Analysis & Visualization — Visualize and explore the VEP annotation results in R.