Somatic Variant Calling with GATK4

From Mutect2 to filtering and copy number analysis

Dr. Flavio Lombardo

2026-04-28

Learning Objectives

After this session, you will be able to:

Explain why somatic calling requires different approaches than germline calling
Describe how Mutect2 detects somatic variants
Understand the role of Panel of Normals and gnomAD in filtering
Interpret FilterMutectCalls output and filter categories
Understand the principles of CNV detection with CNVkit

Recap: The Somatic Calling Challenge

Germline calling assumes:

Heterozygous variants at ~50% VAF
Homozygous variants at ~100% VAF
Consistent across all cells

Somatic calling must handle:

Variable tumor purity (30–80%)
Subclonal populations (VAF <5%)
Copy number alterations shifting expected VAF
Technical artifacts mimicking real variants
FFPE damage (C>T deamination)

The key question:

Is this variant real and somatic, or is it an artifact or germline?

Strategy: multiple layers of evidence

Statistical model (Mutect2 likelihood)
Matched normal comparison
Population databases (gnomAD)
Technical artifact databases (PoN)
Read orientation model (LearnReadOrientationModel)
Post-call filtering (FilterMutectCalls)

Mutect2: How It Works

Mutect2 is a somatic variant caller from GATK4 that uses:

Local haplotype assembly — reassembles reads in active regions (same engine as HaplotypeCaller)
Somatic genotyping model — does NOT assume Hardy-Weinberg equilibrium or fixed ploidy
Tumor-normal joint analysis — models the normal as a filter, not as a diploid reference
Log-odds scoring (TLOD/NLOD) — TLOD: evidence for the variant in the tumor; NLOD: evidence against the variant being present in the normal (germline filter)

Key difference from HaplotypeCaller:

No fixed allele frequency expectation
Models sequencing errors, germline contamination, and artifacts simultaneously
Outputs unfiltered calls with rich annotations for downstream filtering

Mutect2 command structure:

gatk Mutect2 \
  -R reference.fa \
  --intervals targets.interval_list \
  -I tumor.bam \
  -I normal.bam \
  -normal <normal_sample_name> \
  --germline-resource gnomad.vcf.gz \
  --panel-of-normals pon.vcf.gz \
  --f1r2-tar-gz f1r2.tar.gz \
  -O somatic.vcf.gz
# --f1r2-tar-gz: collects read orientation stats
# used by LearnReadOrientationModel in the next step

The -normal flag takes the SM tag from the read group, not the file name!

Hardy-Weinberg equilibrium predicts allele frequencies in a stable, randomly mating population (het variants at ~50%, hom at ~100%). Tumors violate this: purity, copy number changes, and subclonal evolution mean any VAF is possible.

Panel of Normals (PoN)

What is a PoN?

A VCF of sites seen in normal samples that represent:

Systematic sequencing artifacts
Alignment artifacts (e.g., mapping errors)
Recurrent technical noise specific to a platform/protocol

It is NOT a germline filter — it captures technical artifacts that appear as false somatic calls.

How Mutect2 uses it:

Sites present in the PoN receive a panel_of_normals filter tag, indicating they are likely artifacts rather than true somatic variants.

Best practices:

GATK recommends ≥40 normal samples sequenced on the same platform/protocol (fewer can still provide value)
Use gatk CreateSomaticPanelOfNormals
Larger PoN = better artifact capture

When you don’t have enough normals:

Use a public PoN (e.g., 1000 Genomes PoN from GATK resource bundle)
Accept that some platform-specific artifacts may not be captured
In our course: we use the 1000 Genomes PoN (1000g_pon.hg38.subset.vcf.gz)

A PoN built from your own data will always outperform a public one — same capture kit, same sequencer, same library prep.

Germline Resource: gnomAD

Purpose:

Provide population allele frequencies so Mutect2 can estimate the prior probability that a variant is germline.

How it works:

Variants with high population frequency are more likely germline
Mutect2 incorporates this as a Bayesian prior
Helps distinguish: “Is this 50% VAF variant a clonal somatic mutation or a heterozygous germline SNP?”

gnomAD v4.0 at a glance:

~730,000 individuals (v4.0 release)
Exomes + genomes
Multi-ancestry representation

In practice:

--germline-resource \
  af-only-gnomad.hg38.vcf.gz

We use the af-only version — it contains only the AF field, which is all Mutect2 needs. The full gnomAD VCF is much larger.

PoN vs gnomAD — Different Roles

gnomAD: Filters biological germline variants (universal, any sequencing platform)
PoN: Filters technical artifacts (platform-specific, protocol-specific)

Both are needed for high-quality somatic calling.

FilterMutectCalls: Post-Call Filtering

What it does:

Evaluates Mutect2’s raw calls and applies a probabilistic model to classify each variant. It takes the orientation bias artifact priors from LearnReadOrientationModel via --orientation-bias-artifact-priors, which corrects for OxoG and FFPE deamination artifacts before assigning filter tags.

Two types of filtering:

Likelihood-based — evaluates evidence strength:
- weak_evidence: the variant does not meet the minimum tumor log-odds (TLOD) threshold
Hard filters — individual metric thresholds:
- strand_bias: reads only from one direction
- base_qual: poor base quality at variant position
- map_qual: poor mapping quality
- clustered_events: multiple variants clustered together

Common filter tags:

Filter	Meaning
`PASS`	Passed all filters
`weak_evidence`	Below likelihood threshold
`normal_artifact`	Artifact seen in normal
`panel_of_normals`	Present in PoN
`germline`	Likely germline variant
`strand_bias`	One-directional support
`orientation`	Read orientation bias (OxoG/FFPE)
`slippage`	Short tandem repeat artifact
`contamination`	Cross-sample contamination

Filtering commands:

# Step 1: learn orientation bias model
gatk LearnReadOrientationModel \
  -I f1r2.tar.gz \
  -O read-orientation-model.tar.gz

# Step 2: filter calls
gatk FilterMutectCalls \
  -V somatic.vcf.gz \
  -R reference.fa \
  --stats somatic.vcf.gz.stats \
  --orientation-bias-artifact-priors \
    read-orientation-model.tar.gz \
  -O somatic.filtered.vcf.gz

# Step 3: extract PASS variants
bcftools view -f PASS \
  somatic.filtered.vcf.gz \
  -Oz -o somatic.PASS.vcf.gz

Note

somatic.vcf.gz.stats is written automatically by Mutect2 alongside the VCF. FilterMutectCalls uses it to calibrate genome-wide filters.

Interpreting Filter Results

Typical output from our HCC1395 dataset:

129 PASS
 22 normal_artifact;strand_bias
 21 panel_of_normals
 18 weak_evidence
 16 strand_bias;weak_evidence
 14 normal_artifact;slippage;weak_evidence
 14 normal_artifact
 13 orientation;weak_evidence
 13 clustered_events;normal_artifact;strand_bias
 12 germline;multiallelic;normal_artifact;panel_of_normals

Key observations:

~47% of calls pass all filters (129/272)
Variants can have multiple filter tags (e.g., normal_artifact;strand_bias)
PASS variants typically have: tumor VAF >15%, normal VAF <3%, high TLOD
weak_evidence variants: tumor VAF <5%, near detection limit
germline variants: similar VAF in tumor and normal

PASS does not mean validated! Clinical decisions require orthogonal validation (Sanger, ddPCR, amplicon-seq).

Extracting VAF from Mutect2 Output

Two AF fields — don’t confuse them:

Field	Location	Meaning
`INFO/AF`	INFO column	Allele frequency estimated from the data (not gnomAD — gnomAD AF is used as a prior internally and not written to the output VCF)
`FORMAT/AF`	Per-sample column	Tumor allele fraction (what we want)

Extract VAF for every variant:

VARIANTDIR=~/project/variants
bcftools query \
  -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%AF;]\n' \
  "$VARIANTDIR"/somatic.filtered.vcf.gz | head -30

Output:

chr6  106265   map_qual;strand_bias;weak_evidence  3.99    normal=0.007;tumor=0.01;
chr6  325357   PASS                               719.91   normal=0.001;tumor=0.161;
chr6  325783   PASS                              1292.31   normal=0.001;tumor=0.224;
chr6  1742560  panel_of_normals                   235.6    normal=0.02;tumor=0.41;
chr6  5103987  germline;...;panel_of_normals       13.57   normal=0.825;tumor=0.834;
chr6  9900366  panel_of_normals                   525.07   normal=0.024;tumor=0.992;
chr6  13977507 weak_evidence                        3.4    normal=0.012;tumor=0.027;

PASS → tumor VAF >0.15, normal VAF <0.06

weak_evidence → tumor VAF <0.05, near detection limit

germline → similar VAF in tumor and normal

Discover the FORMAT fields in your VCF:

VARIANTDIR=~/project/variants
bcftools view -h "$VARIANTDIR"/somatic.filtered.vcf.gz \
  | grep "^##FORMAT"

Look for: ##FORMAT=<ID=AF,...,Description="Allele fractions of alternate alleles in the tumor"> — FORMAT/AF is the per-sample tumor VAF.

Copy Number Variation in Cancer

Why CNVs matter:

Tumor suppressor loss: TP53 loss-of-function (predominantly point mutations, with frequent 17p LOH) in ~80–85% of TNBC
Oncogene amplification: ERBB2/HER2, MYC, EGFR
Treatment selection: HER2 amplification → trastuzumab
Drug resistance: gene amplification can confer resistance

Detection principle (WES):

Count reads in target regions (bins)
Normalize for GC content and mappability
Compare tumor vs normal read depth ratio
Segment regions with similar ratios
Call gains and losses

Log2 ratio = log2(tumor depth / normal depth)

Interpretation:

Log2 ratio	Copy number	Meaning
0	2	Normal diploid
+0.58	~3	Single copy gain
+1.0	~4	Duplication
-1.0	1	Single copy loss
<< -1	~0	Homozygous deletion

Formula:

\[\text{Copy number} = 2^{\text{log2 ratio}} \times 2\]

Note: This assumes 100% tumour purity and diploid baseline. In practice, purity and ploidy correction is required for absolute copy number estimation.

CNVkit: Practical CNV Detection

Why CNVkit for WES data?

Uses both on-target and off-target reads for coverage
Off-target reads provide additional coverage signal between capture regions
Better segmentation than on-target-only approaches

The batch command does it all:

cnvkit.py batch tumor.bam \
  --normal normal.bam \
  --targets targets.bed \
  --fasta reference.fa \
  --annotate refFlat.txt \
  --output-dir cnvkit/ \
  --scatter --diagram

Key outputs:

File	Description
`.cnr`	Bin-level log2 ratios
`.cns`	Segmented regions
`-scatter.png`	Scatter plot visualization
`-diagram.pdf`	Chromosome diagram

Reading the scatter plot:

Gray dots: individual bin measurements
Orange lines: segmented regions (called CNVs)
Y-axis: log2 ratio (0 = diploid)
X-axis: genomic position

Reference

Talevich et al. (2016) “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing” PLOS Computational Biology

Today’s Workflow

flowchart LR
    A[Tumor BAM] --> C[Mutect2]
    B[Normal BAM] --> C
    D[PoN] --> C
    E[gnomAD] --> C
    C --> J[LearnReadOrientationModel]
    C --> F[FilterMutectCalls]
    J --> F
    F --> G[PASS VCF]
    A --> H[CNVkit]
    B --> H
    H --> I[CNV calls + plots]

    classDef exercise fill:#ffeb3b,stroke:#333,stroke-width:2px;
    class C,J,F,G,H,I exercise;

Exercises today:

Run Mutect2 on tumor-normal pair
Apply FilterMutectCalls and explore the VCF
Examine variant allele frequencies across filter categories
Run CNVkit and interpret copy number scatter plots

All exercise scripts and data are in your ~/project/ directory. Detailed instructions are on the course website.

Essential Reading

Key methodological references:

Benjamin et al. (2019) “Calling Somatic SNVs and Indels with Mutect2” bioRxiv (preprint) doi:10.1101/861054 — the Mutect2 methods paper describing the statistical model and filtering approach
Talevich et al. (2016) “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing” PLOS Computational Biology 12(4): e1004873
Karczewski et al. (2020) “The mutational constraint spectrum quantified from variation in 141,456 humans” Nature 581, 434–443 — the gnomAD v2.1.1 paper (v4.0 release notes at gnomad.broadinstitute.org)

GATK Best Practices

The GATK somatic short variant discovery documentation is an excellent reference for understanding each step of the pipeline.

Exercises

Let’s put this into practice!

Giphy

Head to the Day 2 exercises on the course website.