Somatic Variant Calling with GATK4

From Mutect2 to filtering and copy number analysis

Dr. Flavio Lombardo

2026-04-28

Learning Objectives

After this session, you will be able to:

  • Explain why somatic calling requires different approaches than germline calling
  • Describe how Mutect2 detects somatic variants
  • Understand the role of Panel of Normals and gnomAD in filtering
  • Interpret FilterMutectCalls output and filter categories
  • Understand the principles of CNV detection with CNVkit

Recap: The Somatic Calling Challenge

Germline calling assumes:

  • Heterozygous variants at ~50% VAF
  • Homozygous variants at ~100% VAF
  • Consistent across all cells

Somatic calling must handle:

  • Variable tumor purity (30–80%)
  • Subclonal populations (VAF <5%)
  • Copy number alterations shifting expected VAF
  • Technical artifacts mimicking real variants
  • FFPE damage (C>T deamination)

The key question:

Is this variant real and somatic, or is it an artifact or germline?

Strategy: multiple layers of evidence

  1. Statistical model (Mutect2 likelihood)
  2. Matched normal comparison
  3. Population databases (gnomAD)
  4. Technical artifact databases (PoN)
  5. Read orientation model (LearnReadOrientationModel)
  6. Post-call filtering (FilterMutectCalls)

Mutect2: How It Works

Mutect2 is a somatic variant caller from GATK4 that uses:

  1. Local haplotype assembly — reassembles reads in active regions (same engine as HaplotypeCaller)
  2. Somatic genotyping model — does NOT assume Hardy-Weinberg equilibrium or fixed ploidy
  3. Tumor-normal joint analysis — models the normal as a filter, not as a diploid reference
  4. Log-odds scoring (TLOD/NLOD) — TLOD: evidence for the variant in the tumor; NLOD: evidence against the variant being present in the normal (germline filter)

Key difference from HaplotypeCaller:

  • No fixed allele frequency expectation
  • Models sequencing errors, germline contamination, and artifacts simultaneously
  • Outputs unfiltered calls with rich annotations for downstream filtering

Mutect2 command structure:

gatk Mutect2 \
  -R reference.fa \
  --intervals targets.interval_list \
  -I tumor.bam \
  -I normal.bam \
  -normal <normal_sample_name> \
  --germline-resource gnomad.vcf.gz \
  --panel-of-normals pon.vcf.gz \
  --f1r2-tar-gz f1r2.tar.gz \
  -O somatic.vcf.gz
# --f1r2-tar-gz: collects read orientation stats
# used by LearnReadOrientationModel in the next step

The -normal flag takes the SM tag from the read group, not the file name!

Hardy-Weinberg equilibrium predicts allele frequencies in a stable, randomly mating population (het variants at ~50%, hom at ~100%). Tumors violate this: purity, copy number changes, and subclonal evolution mean any VAF is possible.

Panel of Normals (PoN)

What is a PoN?

A VCF of sites seen in normal samples that represent:

  • Systematic sequencing artifacts
  • Alignment artifacts (e.g., mapping errors)
  • Recurrent technical noise specific to a platform/protocol

It is NOT a germline filter — it captures technical artifacts that appear as false somatic calls.

How Mutect2 uses it:

Sites present in the PoN receive a panel_of_normals filter tag, indicating they are likely artifacts rather than true somatic variants.

Best practices:

  • GATK recommends ≥40 normal samples sequenced on the same platform/protocol (fewer can still provide value)
  • Use gatk CreateSomaticPanelOfNormals
  • Larger PoN = better artifact capture

When you don’t have enough normals:

  • Use a public PoN (e.g., 1000 Genomes PoN from GATK resource bundle)
  • Accept that some platform-specific artifacts may not be captured
  • In our course: we use the 1000 Genomes PoN (1000g_pon.hg38.subset.vcf.gz)

A PoN built from your own data will always outperform a public one — same capture kit, same sequencer, same library prep.

Germline Resource: gnomAD

Purpose:

Provide population allele frequencies so Mutect2 can estimate the prior probability that a variant is germline.

How it works:

  • Variants with high population frequency are more likely germline
  • Mutect2 incorporates this as a Bayesian prior
  • Helps distinguish: “Is this 50% VAF variant a clonal somatic mutation or a heterozygous germline SNP?”

gnomAD v4.0 at a glance:

  • ~730,000 individuals (v4.0 release)
  • Exomes + genomes
  • Multi-ancestry representation

In practice:

--germline-resource \
  af-only-gnomad.hg38.vcf.gz

We use the af-only version — it contains only the AF field, which is all Mutect2 needs. The full gnomAD VCF is much larger.

PoN vs gnomAD — Different Roles

  • gnomAD: Filters biological germline variants (universal, any sequencing platform)
  • PoN: Filters technical artifacts (platform-specific, protocol-specific)

Both are needed for high-quality somatic calling.

FilterMutectCalls: Post-Call Filtering

What it does:

Evaluates Mutect2’s raw calls and applies a probabilistic model to classify each variant. It takes the orientation bias artifact priors from LearnReadOrientationModel via --orientation-bias-artifact-priors, which corrects for OxoG and FFPE deamination artifacts before assigning filter tags.

Two types of filtering:

  1. Likelihood-based — evaluates evidence strength:
    • weak_evidence: the variant does not meet the minimum tumor log-odds (TLOD) threshold
  2. Hard filters — individual metric thresholds:
    • strand_bias: reads only from one direction
    • base_qual: poor base quality at variant position
    • map_qual: poor mapping quality
    • clustered_events: multiple variants clustered together

Common filter tags:

Filter Meaning
PASS Passed all filters
weak_evidence Below likelihood threshold
normal_artifact Artifact seen in normal
panel_of_normals Present in PoN
germline Likely germline variant
strand_bias One-directional support
orientation Read orientation bias (OxoG/FFPE)
slippage Short tandem repeat artifact
contamination Cross-sample contamination

Filtering commands:

# Step 1: learn orientation bias model
gatk LearnReadOrientationModel \
  -I f1r2.tar.gz \
  -O read-orientation-model.tar.gz

# Step 2: filter calls
gatk FilterMutectCalls \
  -V somatic.vcf.gz \
  -R reference.fa \
  --stats somatic.vcf.gz.stats \
  --orientation-bias-artifact-priors \
    read-orientation-model.tar.gz \
  -O somatic.filtered.vcf.gz

# Step 3: extract PASS variants
bcftools view -f PASS \
  somatic.filtered.vcf.gz \
  -Oz -o somatic.PASS.vcf.gz

Note

somatic.vcf.gz.stats is written automatically by Mutect2 alongside the VCF. FilterMutectCalls uses it to calibrate genome-wide filters.

Interpreting Filter Results

Typical output from our HCC1395 dataset:

129 PASS
 22 normal_artifact;strand_bias
 21 panel_of_normals
 18 weak_evidence
 16 strand_bias;weak_evidence
 14 normal_artifact;slippage;weak_evidence
 14 normal_artifact
 13 orientation;weak_evidence
 13 clustered_events;normal_artifact;strand_bias
 12 germline;multiallelic;normal_artifact;panel_of_normals

Key observations:

  • ~47% of calls pass all filters (129/272)
  • Variants can have multiple filter tags (e.g., normal_artifact;strand_bias)
  • PASS variants typically have: tumor VAF >15%, normal VAF <3%, high TLOD
  • weak_evidence variants: tumor VAF <5%, near detection limit
  • germline variants: similar VAF in tumor and normal

PASS does not mean validated! Clinical decisions require orthogonal validation (Sanger, ddPCR, amplicon-seq).

Extracting VAF from Mutect2 Output

Two AF fields — don’t confuse them:

Field Location Meaning
INFO/AF INFO column Allele frequency estimated from the data (not gnomAD — gnomAD AF is used as a prior internally and not written to the output VCF)
FORMAT/AF Per-sample column Tumor allele fraction (what we want)

Extract VAF for every variant:

VARIANTDIR=~/project/variants
bcftools query \
  -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%AF;]\n' \
  "$VARIANTDIR"/somatic.filtered.vcf.gz | head -30

Output:

chr6  106265   map_qual;strand_bias;weak_evidence  3.99    normal=0.007;tumor=0.01;
chr6  325357   PASS                               719.91   normal=0.001;tumor=0.161;
chr6  325783   PASS                              1292.31   normal=0.001;tumor=0.224;
chr6  1742560  panel_of_normals                   235.6    normal=0.02;tumor=0.41;
chr6  5103987  germline;...;panel_of_normals       13.57   normal=0.825;tumor=0.834;
chr6  9900366  panel_of_normals                   525.07   normal=0.024;tumor=0.992;
chr6  13977507 weak_evidence                        3.4    normal=0.012;tumor=0.027;

PASS → tumor VAF >0.15, normal VAF <0.06

weak_evidence → tumor VAF <0.05, near detection limit

germline → similar VAF in tumor and normal

Discover the FORMAT fields in your VCF:

VARIANTDIR=~/project/variants
bcftools view -h "$VARIANTDIR"/somatic.filtered.vcf.gz \
  | grep "^##FORMAT"

Look for: ##FORMAT=<ID=AF,...,Description="Allele fractions of alternate alleles in the tumor">FORMAT/AF is the per-sample tumor VAF.

Copy Number Variation in Cancer

Why CNVs matter:

  • Tumor suppressor loss: TP53 loss-of-function (predominantly point mutations, with frequent 17p LOH) in ~80–85% of TNBC
  • Oncogene amplification: ERBB2/HER2, MYC, EGFR
  • Treatment selection: HER2 amplification → trastuzumab
  • Drug resistance: gene amplification can confer resistance

Detection principle (WES):

  1. Count reads in target regions (bins)
  2. Normalize for GC content and mappability
  3. Compare tumor vs normal read depth ratio
  4. Segment regions with similar ratios
  5. Call gains and losses

Log2 ratio = log2(tumor depth / normal depth)

Interpretation:

Log2 ratio Copy number Meaning
0 2 Normal diploid
+0.58 ~3 Single copy gain
+1.0 ~4 Duplication
-1.0 1 Single copy loss
<< -1 ~0 Homozygous deletion

Formula:

\[\text{Copy number} = 2^{\text{log2 ratio}} \times 2\]

Note: This assumes 100% tumour purity and diploid baseline. In practice, purity and ploidy correction is required for absolute copy number estimation.

CNVkit: Practical CNV Detection

Why CNVkit for WES data?

  • Uses both on-target and off-target reads for coverage
  • Off-target reads provide additional coverage signal between capture regions
  • Better segmentation than on-target-only approaches

The batch command does it all:

cnvkit.py batch tumor.bam \
  --normal normal.bam \
  --targets targets.bed \
  --fasta reference.fa \
  --annotate refFlat.txt \
  --output-dir cnvkit/ \
  --scatter --diagram

Key outputs:

File Description
.cnr Bin-level log2 ratios
.cns Segmented regions
-scatter.png Scatter plot visualization
-diagram.pdf Chromosome diagram

Reading the scatter plot:

  • Gray dots: individual bin measurements
  • Orange lines: segmented regions (called CNVs)
  • Y-axis: log2 ratio (0 = diploid)
  • X-axis: genomic position

Reference

Talevich et al. (2016) “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing” PLOS Computational Biology

Today’s Workflow

flowchart LR
    A[Tumor BAM] --> C[Mutect2]
    B[Normal BAM] --> C
    D[PoN] --> C
    E[gnomAD] --> C
    C --> J[LearnReadOrientationModel]
    C --> F[FilterMutectCalls]
    J --> F
    F --> G[PASS VCF]
    A --> H[CNVkit]
    B --> H
    H --> I[CNV calls + plots]

    classDef exercise fill:#ffeb3b,stroke:#333,stroke-width:2px;
    class C,J,F,G,H,I exercise;

Exercises today:

  1. Run Mutect2 on tumor-normal pair
  2. Apply FilterMutectCalls and explore the VCF
  3. Examine variant allele frequencies across filter categories
  4. Run CNVkit and interpret copy number scatter plots

All exercise scripts and data are in your ~/project/ directory. Detailed instructions are on the course website.

Essential Reading

Key methodological references:

  • Benjamin et al. (2019) “Calling Somatic SNVs and Indels with Mutect2” bioRxiv (preprint) doi:10.1101/861054 — the Mutect2 methods paper describing the statistical model and filtering approach

  • Talevich et al. (2016) “CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing” PLOS Computational Biology 12(4): e1004873

  • Karczewski et al. (2020) “The mutational constraint spectrum quantified from variation in 141,456 humans” Nature 581, 434–443 — the gnomAD v2.1.1 paper (v4.0 release notes at gnomad.broadinstitute.org)

GATK Best Practices

The GATK somatic short variant discovery documentation is an excellent reference for understanding each step of the pipeline.

Exercises

Let’s put this into practice!

Giphy

Head to the Day 2 exercises on the course website.