Technical Report
Published: 10 April 2011

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Mark A DePristo¹,
Eric Banks¹,
Ryan Poplin¹,
Kiran V Garimella¹,
Jared R Maguire¹,
Christopher Hartl¹,
Anthony A Philippakis^1,2,3,
Guillermo del Angel¹,
Manuel A Rivas^1,4,
Matt Hanna¹,
Aaron McKenna¹,
Tim J Fennell¹,
Andrew M Kernytsky¹,
Andrey Y Sivachenko¹,
Kristian Cibulskis¹,
Stacey B Gabriel¹,
David Altshuler^1,3,4 &
…
Mark J Daly^1,3,4

Nature Genetics volume 43, pages 491–498 (2011)Cite this article

74k Accesses
7363 Citations
59 Altmetric
Metrics details

Subjects

Abstract

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

You have full access to this article via your institution.

Download PDF

Genome-wide association studies

Article 26 August 2021

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Main

Recent advances in next-generation sequencing (NGS) technology now provide the first cost-effective approach to large-scale resequencing of human samples for medical and population genetics. Projects such as the 1000 Genomes Project¹ (1KG), The Cancer Genome Atlas and numerous large medically focused exome sequencing projects² are underway in an attempt to elucidate the full spectrum of human genetic diversity¹ and the complete genetic architecture of human disease. The ability to examine the entire genome in an unbiased way will make possible comprehensive searches for standing variation in common disease and mutations underlying linkages in Mendelian disease³, as well as spontaneously arising variation for which no gene-mapping shortcuts are available (for example, somatic mutations in cancer^4,5,6 and de novo mutations⁷ (Conrad, D.F. et al. unpublished data) in autism and schizophrenia).

Many capabilities are required to obtain a complete and accurate record of the variation from NGS from sequencing data. Mapping reads to the reference genome^8,9,10,11 is a first critical computational challenge whose cost necessitates that each read be aligned independently, guaranteeing that many reads spanning indels will be misaligned. The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base¹², are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context^13,14,15. These misaligned reads and inaccurate quality scores propagate into SNP discovery and genotyping, a general problem that becomes acute in projects with multiple sequencing technologies generated by many centers using rapidly evolving experimental processing pipelines, such as the 1000 Genomes Project.

Given well-mapped, aligned and calibrated reads, resolving even simple SNPs, let alone more complex variation such as multi-nucleotide substitutions, insertions and deletions, inversions, rearrangements and copy number variation, requires sensitive and specific statistical models^{8,9,10,11,15,16,17,18,19,20,21,22,23,24,25}. Separating true variation from machine artifacts as a result of the high rate and context-specific nature of sequencing errors is the outstanding challenge in NGS analysis. Previous approaches have relied on filtering SNP calls that have characteristics outside of their normal ranges, such as those occurring at sites with too much coverage^17,19, or by requiring non-reference bases to occur on at least three reads in both synthesis orientations²⁰. Though effective, such hard filters are frustratingly difficult to develop, require parameterization for each new dataset and are necessarily either restrictive (high specificity, as in the 1000 Genomes Project) or tolerant (high sensitivity, used in Mendelian disease studies, with concomitantly more false positives). Moreover, all of these challenges must be addressed within the context of a proliferation of sequencing technology platforms and study designs (for example, whole-genome shotgun, exome capture sequencing and multiple samples sequenced at shallow coverage), a point not tackled in previous work.

Here we present a single framework and the associated tools capable of discovering high-quality variation and genotyping individual samples using diverse sequencing machines and experimental designs (Fig. 1). We present several new methods addressing the challenges listed above in local realignment, base quality recalibration, multi-sample SNP calling and adaptive error modeling, which we apply to three prototypical NGS datasets (Table 1). In each dataset, we included CEPH individual NA12878 to show the consistency of results for this individual across all three datasets.

**Figure 1: Framework for variation discovery and genotyping from next-generation DNA sequencing.**

Table 1 Next-generation DNA sequencing datasets analyzed

Full size table

Results

Below we describe a three-part conceptual framework (Fig. 1).

• Phase 1: raw read data with platform-dependent biases were transformed into a single, generic representation with well-calibrated base error estimates, mapped to their correct genomic origin and aligned consistently with respect to one another. Mapping algorithms placed reads with an initial alignment on the reference genome, either generated in, or converted to, the technology-independent SAM reference file format²⁴. Next, molecular duplicates were eliminated (Supplementary Note), initial alignments were refined by local realignment and then an empirically accurate per-base error model was determined.

• Phase 2: the analysis-ready SAM/BAM files were analyzed to discover all sites with statistical evidence for an alternate allele present among the samples including SNPs, short indels and copy number variations (CNVs). CNV discovery and genotyping methods, though part of this conceptual framework, are described elsewhere²⁵.

• Phase 3: technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium (LD), and family and population structure were integrated with the raw variant calls from phase 2 to separate true polymorphic sites from machine artifacts, and at these sites, high-quality genotypes were determined for all samples.

All components after initial mapping and duplicate marking were instantiated in the Genome Analysis Toolkit (GATK)²⁶.

Applying the analysis pipeline to HiSeq

Of the 2.83 billion non-N bases in the autosomal regions and chromosome X of the human reference genome, 2.72 billion bases (∼96%) had sufficient coverage to call variants in the 101-bp paired-ended HiSeq data (Table 1). Even though the HiSeq reads were aligned with the gap-enabled BWA¹⁰, more than 15% of the reads that span known homozygous indels in NA12878 were misaligned (Supplementary Table 1). Realignment corrected 6.6 million of 2.4 billion total reads in 950,000 regions covering 21 Mb in the HiSeq data, eliminating 1.8 million loci with substantial accumulation of mismatching bases (Supplementary Table 2). The initial data-processing steps (phase 1) eliminated ∼300,000 SNP calls, which is more than one fifth of the raw new calls, with quality metrics consistent with more than 90% of these SNPs being false positives (Table 2).

Table 2 Raw to recalibrated, imputed SNP calls HiSeq, Exome and 61 sample low-pass datasets

Full size table

The initial 4.2 million confidently called non-reference sites included 99.7% and 99.5% of the HapMap3 and 1KG Trio sites, respectively, genotyped as non-reference in NA12878; at these variant sites, the sequencing and genotyping calls were concordant 99.9% of the time (Table 2). Variant quality score recalibration of these initial calls identified a tranche of SNPs with estimated false discovery rate (FDR) of <1%, containing 3.2 million known variants and 362,000 new variants, a 90% dbSNP rate, and transition/transversion (Ti/Tv) ratios of 2.15 and 2.05, respectively, consistent with our genome-wide expectations (Online Methods). Although the variant recalibrator removed ∼595,000 total variants with a Ti/Tv ratio of ∼1.2, it retained 99% of the HapMap3 and 97.3% of the 1KG Trio non-reference sites. The discordant sites have 100 times higher genotype discrepancy rates, suggesting that the sites themselves may be problematic. Almost all of the variants in the 1% tranche are already present in the even higher stringency 0.1% FDR tranche, and analysis of the 10% FDR tranche suggests that some more variants could be obtained, but at the cost of many more false positives.

Applying the analysis pipeline to 28-Mb exome capture

The raw data processing tools here eliminated ∼450 new call sites from the raw call set, representing more than 20% of all the new calls, with a Ti/Tv of 0.30—fully consistent with all being false positives—and adding several sites present in HapMap3 and the 1KG Trio. The raw whole-exome data-call set, at ∼150× coverage (Table 1), includes >99% of both the HapMap3 and 1KG Trio non-reference sites within the 28-Mb exome target region, with >99.8% genotype concordance at these sites. As with the HiSeq data, even with recalibration and local realignment, the Ti/Tv ratio of the new sites in the initial SNP calls indicates that more than 50% of these calls are false positives. Variant quality score recalibration, using only ∼5,400 SNPs for training, identified a high-quality subset of calls that captured >98% of the HapMap3 and 1KG Trio sites in the target regions. The value of the tranches was more pronounced in the whole exome (Fig. 4d), where 900 of the 1,039 new calls come from tranches with FDRs under 1%, despite needing to reach into the 10% FDR tranche to include most true positive SNPs.

Figure 2: Integrative genomics viewer (IGV) visualization of alignments in region chr.1: 1,510,530–1,510,589 from the Trio NA12878 Illumina reads from the 1000 Genomes Project (a) and NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment (b).

Figure 3: Raw (pink) and recalibrated (blue) base quality scores for NGS paired-end read sets of NA12878 of Illumina/GA (a), Roche/454 (b) and Life/SOLiD (c) lanes from the 1000 Genomes Project and Illumina/HiSeq (d).

**Figure 4: Results of variant quality recalibration on HiSeq, exome and low-pass data sets.**

The HiSeq whole genome shotgun (WGS) and exome capture datasets differed drastically in their sequencing protocols (WGS versus hybrid capture), the sequencing machines (HiSeq versus Genome Analyzer) and the initial alignment tools (BWA¹⁰ versus MAQ⁹). Nevertheless, the exome call set is remarkably consistent the subset of calls from HiSeq that overlap the target regions of the hybrid capture protocol. Ninety-four percent of the HiSeq calls were also called in the final exome set sliced at 10% FDR (data not shown), and at these sites, the non-reference discrepancy rate was extremely low (<0.4%). Mapping differences between the aligners used for HiSeq (BWA) and exome (MAQ) datasets accounted for vast the majority of these discordant calls, with the remainder of the differences being because of limited coverage in the exome and only a small minority of sites being because of differential SNP calling or variant quality score recalibration. Overall, despite the technical differences in the capture and sequencing protocols of the HiSeq and exome datasets, the data processing pipeline presented here uncovered a remarkably consistent set of SNPs in exomes with excellent genotyping accuracy.

Applying the analysis pipeline to low-pass (4×) sequencing

Multi-sample low-pass resequencing poses a major challenge for variant discovery and genotyping because there is so little evidence at any particular locus in the genome for any given sample (Table 1). Consequently, it is in precisely this situation, where there is little signal from true SNPs, that our data processing tools are most valuable, as can be seen from the progression of call sets in Table 2. Local realignment and base quality recalibration eliminated ∼650,000 false-positive SNPs among 13 million sites, 4 times more sites than in the HiSeq dataset, with an aggregate Ti/Tv of 0.7. The initial low-pass CEU set includes over 13 million called sites among all individuals, of which nearly 7 million are new. NA12878 herself has 2.9 million variants, of which 430,000 are new. The 4× average coverage limits the sensitivity and concordance of this call set, with only 84% and 80% of HapMap3 and 1KG Trio sites, respectively, assigned a non-reference genotype in the NA12878 sample, both with a ∼20% non-reference discrepancy (NRD) rate.

The variant quality recalibrator identified from the 13 million potential variants ∼6 million known and 1.5 million new sites in tranches with 0.1% to 10% FDR. Figure 5a highlights several key features of the data: the allele frequency distribution of these calls closely matched the population genetics expectation, and the vast majority of HapMap3 and 1000 Genomes Project official CEU call sites were recovered, with the proportion nearing 100% for more common variant sites (Fig. 5a). Although we selected a 0.1% FDR tranche for analysis here, which contains the bulk of HapMap3, 1KG Trio and HiSeq sites, there are another ∼700,000 true sites that can be found in the 1% and 10% FDR tranches, albeit among many more false positives. This highest-quality tranche includes nearly all variants observed more than five times in the samples and 1.4 million new variants, with the SNPs in the tranches at 1% and 10% FDR generally occupying the lower alternate allele frequency range (Fig. 5b). The overall picture is clear: calling multiple samples simultaneously, even with only a handful of reads spanning a SNP for any given sample, enables one to detect the vast majority of common variant sites present in the cohort with a high degree of sensitivity.

**Figure 5: Variation discovered among 60 individuals from the CEPH population from the 1000 Genomes Project pilot phase plus low-pass NA12878.**

Although the bulk properties of the 61-sample call set were good, we expected the low-pass 4× design to limit variation discovery and genotyping in each sample relative to deep resequencing. In the 61-sample call set, we discovered ∼80% of the non-reference sites in NA12878 according to the HapMap3, 1KG Trio and HiSeq call sets (Table 2). The ∼20% of the missed variant sites from these three datasets had little to no coverage in the NA12878 sample in the low-pass data and, therefore, could not be assigned a genotype using only the NGS data, a general limitation of the low-pass sequencing strategy (Table 2 and Fig. 5c,d). The multi-sample discovery design, however, affords us the opportunity to apply imputation to refine and recover genotypes at sites with little or no sequencing data. Applying genotype-likelihood–based imputation with Beagle²⁷ to the 61-sample call set recovered an additional 15–20% of the non-reference sites in NA12878 that had insufficient coverage in the sequencing data (Table 2) as well as vastly improving genotyping accuracy (Fig. 5c,d).

We further characterized the quality of our low-pass call set as a function of the number of samples included during the discovery process in addition to NA12878 herself. Increasing the number of samples in the cohort rapidly improved both the sensitivity and specificity of the call set. As evidence mounts with more samples that a particular site is polymorphic, our confidence in the call increases and the site is more likely to be called (Fig. 6a). Distinguishing true positive variants from sequencing and data processing artifacts is more difficult with few samples and, consequently, low aggregated coverage; adding more reads allows the error covariates to identify sites as errors using the variant recalibrator (Fig. 6b,c).

**Figure 6: Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples.**

The combination of multi-sample SNP calling, variant quality recalibration using error covariates and imputation allows one to achieve a high-quality call set, both in aggregate and per sample, with very little data. The aggregated 61-sample set at 4× coverage includes only four times as much sequencing data as the HiSeq data, yet we discovered 3.2 million polymorphic sites in NA12878, which includes 97%, 91% and 87% of the variants in the HapMap3, 1000 Genomes Project Trio and HiSeq call sets, respectively, while also finding ∼5 million additional variants among the 60 other samples.

Hard filtering versus variant quality score recalibration

Supplementary Table 3 lists the quality of call sets derived using our previous filtering approaches on all three datasets relative to the adaptive recalibrator described here. In all cases, the adaptive approach outperformed the manually optimized hard filtering previously developed for this calling system for the 1000 Genomes Project pilot data. This highlights two important points: first, that a principled integration of all covariates (which may have a complex correlation structure) should and does outperform single manually defined thresholds on covariates independently, with the added benefit of not requiring human intervention; and second, that an accurate ranking of discovered putative variants by the probability that each represents a true site permits the definition of tranches for specificity or sensitivity (Fig. 4c–e) as appropriate to the needs of the specific project. Although the most permissive tranche includes almost all sites that have any chance of being true polymorphisms—critical for projects looking for single large-effect mutations—the vast majority of true polymorphisms are present in the highest quality tranche of data (data not shown).

Comparison of this calling pipeline to Crossbow

To calibrate the additional value of the tools described here, we contrasted our results with SNPs called on our raw NA12878 exome data using Crossbow²⁸, a package combining bowtie, a gapless read mapping tool based on the Burrows-Wheeler transformation²⁹, and SoapSNP for SNP detection¹⁵. We chose to perform this analysis on the exome data because its wide range of read depths and complex error modes make SNP calling a challenge, especially given the small number of new variants (∼1,000 per sample) expected in this 28-Mb target. In Supplementary Table 4, the high-level results of the GATK and Crossbow calling pipelines are compared and contrasted. Key metrics such as the number of new SNP calls, their Ti/Tv ratio, the number of calls not seen in either the 1000 Genomes Project trio or the HiSeq data and the high nonsense and read-through rates indicate that the Crossbow call set has lower specificity than the GATK pipeline. This was true even after we applied an aggressive P value threshold (P < 0.01) for the base quality rank sum test¹⁵ to filter false-positive variants, which reduced the sensitivity of the HM3, 1000 Genomes Project and the HiSeq call sets by >3%. The intersection set between GATK and Crossbow is more specific but less sensitive than the calls unique to each pipeline (Table 1), a clear sign that despite the advances presented here, a lot of work remains to be done in perfecting calling in datasets like single sample exome capture. Although the value of the data processing and error modeling presented here is also clear, applying local realignment and base quality score recalibration (using publicly available, easy-to-use modules in the GATK) are likely to improve the results of the Crossbow pipeline.

Discussion

The inaccuracy and covariation patterns differ strikingly between sequencing technologies (Fig. 3), which, if uncorrected, can propagate into downstream analyses. Accurately recalibrated base quality scores eliminate these sequencer-specific biases (Fig. 3) and enable integration of data generated from multiple systems. Although developed for early NGS datasets like those from the 1000 Genomes Project pilot, the impact of recalibration is still substantial even for data emerging today on newer sequencers like the HiSequation (2000). Together with local realignment, these two data processing methods eliminated millions of mostly false positive variants while preserving nearly all true variable sites, such as those in HapMap3 and 1KG Trio (Table 2). In single sample datasets, such as HiSeq and exome, without realignment and recalibration, these false variants account for more than a fifth of all of the new calls.

Even with very deep coverage, the naïve Bayesian model for SNP calling results in an initial call set with a surprisingly large number of false-positive calls. Although we expected 3.3 million known and 330,000 new non-reference sites in a single European sample sequenced genome wide, the initial HiSeq call set contains 3.5 million known and 800,000 new calls. The excessive number of variable sites, and the low Ti/Tv ratio in particular among the new calls, implies that ∼600,000 of these variants are likely errors resulting from stochastic and systemic sequencing and alignment errors. The same calculations suggest that a similar fraction of the initial exome calls are likely false positives, and more than 80% of the initial new low-pass SNP calls are likely errors. The adaptive error modeling developed here enabled us to identify these false-positive variants based on their dissimilarity to known variants, despite error rates of 50–80% among the new variants.

In each step of the pipeline, the improvements derive from the correction of systematic errors made in base calling or read mapping. By characterizing the specific NGS machine error processes and capturing our certainty, or lack thereof, that a putative variant is truly present in the sample or population, we delivered not a single concrete call set but a continuum from confident to less reliable variant calls for use as appropriate to the specific needs of downstream analysis. Mendelian disease projects can select a more sensitive set of calls with a higher error rate to avoid missing that single, high-impact variant, whereas community resource projects like the 1000 Genomes Project can place a high premium on specificity.

The division between SNP discovery and preliminary genotyping and genotype refinement (columns 2 and 3 of Fig. 1) avoids embedding in the discovery phase assumptions about population structure, sample relationships and the LD relationships between variants. Consequently, our calling approach applies equally well to population samples in Hardy-Weinberg equilibrium like mother-father-child trios or interbreeding families suffering from Mendelian disorders. Critically, our framework produces highly sensitive and specific variation calls without the use of LD and so can be applied in situations where LD information is unavailable or weak (many organisms) or would confound analytic goals such as studying LD patterns themselves or comparing Neanderthals and modern humans³⁰. Where appropriate, however, imputation can be applied to great value, as we demonstrated in the 61-sample CEU low-pass call set.

The analysis results presented here clearly indicate that even with our best current approaches we are still far from obtaining a complete and accurate picture of genetic variation of all types in even a single sample. Even with the HiSeq 10-bp paired-end reads, nearly 4% (∼100 Mb) of the potentially callable genome is considered poorly mapped (Supplementary Note), and analysis of variants within these regions requires care. Nearly two thirds of the differences between the HiSeq and exome call sets can be attributed to different read mappings between BWA and MAQ.

The challenge of obtaining accurate variant calls from NGS data is substantial. We have developed an analysis framework for NGS data that achieves consistent and accurate results from a wide array of experimental design options including diverse sequencing machinery and distinct sequencing approaches. We have introduced here an integrated approach to data processing and variation discovery from NGS data that is designed to meet these specifications. Using data generated both at the Broad Institute and throughout the 1000 Genomes Project, we have shown that the introduction of improved calibration of base quality scores, local realignment to accommodate indels, the simultaneous evaluation of multiple samples from a population, and finally, an assessment of the likelihood that an identified variable site is a true biological DNA variant greatly improves the sensitivity and specificity of variant discovery from NGS data. The impending arrival of yet more NGS technologies makes even more important modular, extensible frameworks like ours that produce high-quality variant and genotype calls despite distinct error modes of multiple technologies for many experimental designs.

Methods

Evaluating the quality of SNP calls.

Number of SNP calls and allele frequency. The number of calls and frequency for multi-sample calling should follow relatively closely the neutral expectation for N individuals for small N:

where L is the number of confidently called bases and θ is the population-specific heterozygosity, genome wide of ∼0.8 × 10⁻³ for CEPH individuals (H. Li, unpublished data). A surplus of variants, especially heterozygous variants for single samples or lower-frequency variants for populations, is a strong indicator of false positives.

dbSNP rate.

Most variants are already catalogued in the dbSNP database of human variation. For a single European sample, ∼90% of their true variants will appear in dbSNP build 129 (Supplementary Table 5), which will reach ∼99% following the completion of the 1000 Genomes Project (Supplementary Fig. 1). For population-level SNP calls, the aggregate dbSNP rate for the call set decreases as more rare variants are found, which are less frequently found in dbSNP. Nevertheless, the per sample dbSNP rate should remain consistent across individuals. Note that presence in dbSNP is not an absolute confirmation that a variant is true (for example, see Fig. 2 and Fig. 4), but because dbSNP build 129 contains 11.6 million SNP entries (only 0.4% of all genomic positions), relative differences between call sets with high dbSNP rates can be reasonably interpreted as quality differences.

Non-reference sensitivity and non-reference discrepancy (NRD) rate.

For single samples, comparison with non-reference genotype calls from microarray chips, such as HapMap3 (∼1.3–1.5 million sites), provides a good initial assessment of variant discovery sensitivity. With sufficient coverage, >99% of non-reference sites can generally be discovered. The NRD rate reports the percent of discordant genotype calls at commonly called non-reference sites on the chip and should reach <1% with sufficient coverage. Mathematical definitions of these terms are:

Transition/transversion ratio (Ti/Tv).

The Ti/Tv ratio is a critical metric for assessing the specificity of new SNP calls. Inter-species comparisons³⁴ and previous sequencing projects (Supplementary Table 6) agree on a Ti/Tv ratio of ∼2.0–2.1 for genome-wide datasets and 3.0–3.3 for exonic variation³⁵. The expected values for the Ti/Tv for known and new variants genome wide are 2.10 and 2.07, respectively, and in the exome target are 3.5 and 3.0, respectively. Currently the lower Ti/Tv ratio at new sites than at known sites is because of a combination of residual false positives lowering the Ti/Tv, a relative deficit of transitions due to sequencing context bias, as well as an apparently higher transition ratio at lower frequency variation. These uncertainties should limit the interpretation of minor differences in Ti/Tv ratios (<0.05), especially across sequencing technologies and datasets.

The Ti/Tv ratio for randomly assigned 'variation', such as results from systematic sequencing errors, alignment artifacts and data processing failures will be ∼0.5, as there are two transversion mutations for each transition. Given an expected Ti/Tv ratio, as above, and an observed Ti/Tv ratio from a call set, an estimate of the fraction of false positive variants in the call set can be obtained by:

which should be bounded above by 100% (because of Ti/Tv ratios below 0.5) and a minimum false-pisitive rate (here assumed to be 0.1%) when the observed Ti/Tv exceeds the expected value.

Local multiple sequence realignment.

We developed a local realignment algorithm that provides a consistent alignment among all reads spanning an indel. The algorithm begins by first identifying regions for realignment where (i) at least one read contains an indel, (ii) there exists a cluster of mismatching bases or (iii) an already known indel segregates at the site (for example, from dbSNP). At each region, haplotypes are constructed from the reference sequence by incorporating any known indels at the site, indels in reads spanning the site or from Smith-Waterman³⁶ alignment of all reads that do not perfectly match the reference sequence. For each haplotype H_i, reads are aligned without gaps to H_i and scored according to:

where R_j is the jth read, k is the offset in the gapless alignment of R_j and H_i and ɛ_j,k is the error rate corresponding to the declared quality score for the k^th base of read R_j. The haplotype H_i that maximizes L(H_i) is selected as the best alternative haplotype. Next, all reads are realigned against just the best haplotype H_i and the reference (H₀), and each read R_j is assigned to H_i or H₀ depending on whichever maximizes L(R_j|H). The reads are realigned if the log odds ratio of the two-haplotype model is better than the single reference haplotype by at least five log units:

This discretization reflects a tradeoff between accuracy and efficient calculation of the full statistical quantities. Note that this algorithm operates on all reads across all individuals simultaneously, which ensures consistency in the inferred haplotypes among all individuals, a critical property for reliable indel calling and contrastive analyses such as somatic SNP and indel calling. The realigned reads are written to a SAM/BAM file for further analysis. The reads around a homozygous deletion, before and after local realignment, for Genome Analyzer reads from the 1000 Genomes Project and HiSeq, are shown in Figure 2.

Base quality score recalibration.

We developed a base quality recalibration algorithm that provides empirically accurate base quality scores for each base in every read while also correcting for error covariates like machine cycle and dinucleotide context, as well as supporting platform-specific error covariates like color-space mismatches for SOLiD and flow-cycles for 454 (refs. 13,14,15,37,38). For each lane, the algorithm first tabulates empirical mismatches to the reference at all loci not known to vary in the population (dbSNP build 129), categorizing the bases by their reported quality score (R), their machine cycle in the read (C) and their dinucleotide context (D). For each category we estimate the empirical quality score:

These covariates are then broken into linearly separable error estimates and the recalibrated quality score Q_recal is calculated as:

where each ΔQ and ΔΔQ are the residual differences between empirical mismatch rates and that implied by the reported quality score for all observations conditioning only on Q_r or on both the covariate and Q_r; Q_r is the base′s reported quality score and ɛ_r is its expected error rate; b_r,c,d is a base with specific covariate values, and r, c, d and R, C, D are the sets of all values of reported quality scores, machine cycles and dinucleotide contexts, respectively. The quality score and covariate distributions for four datasets before and after quality score recalibration are shown in Figure 3.

Multi-sample SNP calling.

We apply a Bayesian algorithm for variant discovery and genotyping that simultaneously estimates the probability that two alleles A, the reference allele, and B, the alternative allele, are segregating in a sample of N individuals and the likelihoods for each of the AA, AB and BB genotypes for each of individual. Given D_i aligned bases at a specific genomic position for individual i, we estimate the genotype likelihoods GT_i of observing the D_i bases for each of AA, AB and BB genotypes according to the following equation:

where Pr{D_i,j | GT_i} is the probability of observing base D_i,j under the hypothesized genotype GT_i; Pr{D_i,j | B} and Pr{D_i,j | A} are the probability of observing base D_i,j given that the true base is B or A, respectively; ɛ_i,j is the probability of a base miscall given the quality score of base D_i,j; and Pr{B is true | D_i,j is miscalled} is the probability of B_true being the true chromosomal base given that b is a miscall (Supplementary Table 7). As these are raw likelihoods, no prior probabilities are applied.

Let us define q_i = {0,1,2} as the number of alternate B alleles carried by individual i, so that is the number of chromosomes carrying the B allele among all individuals. We estimate the probability that q = X as:

where Γ is the set of all genotype assignments for the N individuals that contain exactly q = X B alleles, Pr{q = X} is the infinite-sites neutral expectation to observe X alternative alleles in 2N chromosomes with heterozygosity of θ, and GT_i and D_i are the ith individual's genotype and NGS reads, respectively. The sum over Γ involves potentially evaluating 3^N combinations but can be approximated by a heuristic algorithm like expectation-maximization through the introduction of a Hardy-Weinberg equilibrium assumption, using a greedy combinatorial search algorithm (Supplementary Note) or using an exact summation (H. Li, unpublished data). This algorithm emits the probability of a variant segregating at the site at some frequency:

represented conventionally by the Phred-scaled confidence, as well as the genotype assignments at the value that maximizes Pr{q | D}. Only sites with QUAL > Q50 for deep coverage or Q10 for shallow coverage, respectively, are considered here as potentially variable sites.

Variant quality score recalibration.

Given a set of putative variants along with SNP error covariate annontations, variant quality score recalibration employs a variational Bayes Gaussian mixture model (GMM)³⁹ to estimate the probability that each variant is a true polymorphism in the samples rather than a sequencer, alignment or data processing artifact. The set of variants {v_i} are treated as an n-dimensional point cloud, each variant v_i positioned by its covariate annotation vector, . A mixture of Gaussians is fit to the set of likely true variants, here approximated by the variants already present in HapMap3 (Fig. 4a). Following training, this mixture model is used to estimate the probability of each variant call being true (Fig. 4b), capturing the intuition that variants with similar characteristics as previously known variants are likely to be real, whereas those with unusual characteristics are more likely to be machine or data processing artifacts.

Mathematically, we write the probability of a variant's vector of covariate values as the linear superposition of Gaussians:

where K is the number of Gaussians in the mixture (GMM), and the last two equations are standard conjugate prior distributions over the parameters , and ∑.

We then use an analog of the expectation-maximization algorithm³⁹ to learn the optimal parameters for the clusters using only variant calls at sites present in HapMap3. By restricting training to known polymorphic sites, the resulting GMM captures the distribution of covariate parameters for true SNPs. Consequently, we estimate the likelihood of each putative variant v_i being true under the learned GMM as:

where Pr{v_i} is the prior expectation that the putative variant v_i is true, is the vector of covariate values for v_i, FP_singleton is the false positive rate for singletons (50% here), and AC is the number of chromosomes estimated to carry the variant among all called samples. The prior probability of Pr{v_i} depends on whether it is present in HapMap3 and its frequency in the samples being called, given an estimate of the false positive rate for singletons. This model can be easily extended to include more training data, more prior information and/or more error covariates.

For convenience of presentation and analysis, we partition the raw SNP calls into tranches based on the Ti/Tv ratio of their new variants. For each desired new false discovery rate target (FDR_i), tranche_i is defined as:

The first tranche is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. More specificity in the learned GMM translates into better-separated tranches, where all true variants have high likelihoods and appear in the lowest FDR tranches and all false ones have low likelihoods and are excluded. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real.

References

The 1000 Genomes Project Consortium. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).
Article CAS Google Scholar
Ng, S.B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2009).
Article Google Scholar
Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).
Article CAS Google Scholar
Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2009).
Article Google Scholar
Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010).
Article CAS Google Scholar
Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).
Article CAS Google Scholar
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Article CAS Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729 (2001).
Article CAS Google Scholar
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).
Article CAS Google Scholar
Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 18, 763–770 (2008).
Article CAS Google Scholar
Li, M., Nordborg, M. & Li, L.M. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 32, 5183–5191 (2004).
Article CAS Google Scholar
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).
Article CAS Google Scholar
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Koboldt, D., Chen, K., Wylie, T. & Larson, D. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).
Article CAS Google Scholar
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Article CAS Google Scholar
Mokry, M. et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries. Nucleic Acids Res. 38, e116 (2010).
Article Google Scholar
Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 20, 273–280 (2010).
Article CAS Google Scholar
Hoberman, R. et al. A probabilistic approach for SNP discovery in high-throughput human resequencing data. Genome Res. 19, 1542–1552 (2009).
Article CAS Google Scholar
Malhis, N. & Jones, S. High quality SNP calling using Illumina data at shallow coverage. Bioinformatics 26, 1029 (2010).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).
Article CAS Google Scholar
McKenna, A.H. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).
Article CAS Google Scholar
Langmead, B., Schatz, M.C., Lin, J., Pop, M. & Salzberg, S.L. Searching for SNPs with cloud computing. Genome Biol. 10, R134 (2009).
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article Google Scholar
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Article CAS Google Scholar
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).
Article CAS Google Scholar
Ng, S., Turner, E., Robertson, P. & Flygare, S. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Article CAS Google Scholar
Mckernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009).
Article CAS Google Scholar
Ebersberger, I., Metzler, D., Schwarz, C. & Pääbo, S. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70, 1490–1497 (2002).
Article CAS Google Scholar
Freudenberg-Hua, Y. et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res. 13, 2271–2276 (2003).
Article CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, Cambridge, UK, 1998).
Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105 (2008).
Article Google Scholar
HUGO Consortium. et al. Mapping human genetic diversity in Asia. Science 326, 1541–1545 (2009).
Bishop, C. Pattern Recognition and Machine Learning (Springer, New York, New York, USA, 2006).

Download references

Acknowledgements

Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis Toolkit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.

Author information

Authors and Affiliations

Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler & Mark J Daly
Brigham and Women's Hospital, Boston, Massachusetts, USA
Anthony A Philippakis
Harvard Medical School, Boston, Massachusetts, USA
Anthony A Philippakis, David Altshuler & Mark J Daly
Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts, USA
Manuel A Rivas, David Altshuler & Mark J Daly

Authors

Mark A DePristo
View author publications
You can also search for this author in PubMed Google Scholar
Eric Banks
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Poplin
View author publications
You can also search for this author in PubMed Google Scholar
Kiran V Garimella
View author publications
You can also search for this author in PubMed Google Scholar
Jared R Maguire
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Hartl
View author publications
You can also search for this author in PubMed Google Scholar
Anthony A Philippakis
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo del Angel
View author publications
You can also search for this author in PubMed Google Scholar
Manuel A Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Matt Hanna
View author publications
You can also search for this author in PubMed Google Scholar
Aaron McKenna
View author publications
You can also search for this author in PubMed Google Scholar
Tim J Fennell
View author publications
You can also search for this author in PubMed Google Scholar
Andrew M Kernytsky
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Y Sivachenko
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Cibulskis
View author publications
You can also search for this author in PubMed Google Scholar
Stacey B Gabriel
View author publications
You can also search for this author in PubMed Google Scholar
David Altshuler
View author publications
You can also search for this author in PubMed Google Scholar
Mark J Daly
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.A.D., E.B., R.P., K.V.G., J.R.M., C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S. and K.C. conceived of, implemented and performed analytic approaches. M.A.D., E.B., R.P., K.V.G., G.d.A., A.M.K. and M.J.D. wrote the manuscript. M.A.D., M.H. and A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D., S.B.G., D.A. and M.J.D. lead the team.

Corresponding author

Correspondence to Mark A DePristo.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1, Supplementary Tables 1–7 and Supplementary Note (PDF 806 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

DePristo, M., Banks, E., Poplin, R. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). https://doi.org/10.1038/ng.806

Download citation

Received: 27 August 2010
Accepted: 17 March 2011
Published: 10 April 2011
Issue Date: May 2011
DOI: https://doi.org/10.1038/ng.806

This article is cited by

The nature and distribution of putative non-functional alleles suggest only two independent events at the origins of Astyanax mexicanus cavefish populations
- Maxime Policarpo
- Laurent Legendre
- Didier Casane
BMC Ecology and Evolution (2024)
Fine mapping and identification of two NtTOM2A homeologs responsible for tobacco mosaic virus replication in tobacco (Nicotiana tabacum L.)
- Xuebo Wang
- Zhan Shen
- Dan Liu
BMC Plant Biology (2024)
Comparison of capture-based mtDNA sequencing performance between MGI and illumina sequencing platforms in various sample types
- Zehui Feng
- Fan Peng
- Xu Guo
BMC Genomics (2024)
Introgressions lead to reference bias in wheat RNA-seq analysis
- Benedict Coombes
- Thomas Lux
- Anthony Hall
BMC Biology (2024)
COSAP: Comparative Sequencing Analysis Platform
- Mehmet Arif Ergun
- Omer Cinal
- Mehmet Baysan
BMC Bioinformatics (2024)