To the editor:
Applications of rapidly advancing sequencing technology exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon-capture techniques will direct sequencing efforts to the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow.
Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/, Supplementary Software), for predicting damaging effects of missense mutations. PolyPhen-2 is different from the earlier tool PolyPhen1 in the set of predictive features, the alignment pipeline and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1), which were selected automatically by an iterative greedy algorithm (Supplementary Methods). The majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele. The alignment pipeline selects a set of homologous sequences using a clustering algorithm and then constructs and refines its multiple alignment (Supplementary Fig. 1). The most informative predictive features characterize how likely the two human alleles are to occupy the site given the pattern of amino-acid replacements in the multiple-sequence alignment; how distant the protein harboring the first deviation from the human wild-type allele is from the human protein; and whether the mutant allele originated at a hypermutable site2. The functional importance of an allele replacement is predicted from its individual features (Supplementary Figs. 2, 3, 4) by a naive Bayes classifier (Supplementary Methods).
We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles annotated in the UniProt database as causing human Mendelian diseases and affecting protein stability or function, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be nondamaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt and 8,946 human nonsynonymous single-nucleotide polymorphisms (nsSNPs) without annotated involvement in disease, which we treated as nondamaging.
We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to that of PolyPhen (Fig. 1b) and it also compared favorably with that of three other popular prediction tools4,5,6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieved true positive prediction rates of 92% and 73% on HumDiv and HumVar datasets, respectively (Supplementary Table 2).
One reason for the lower accuracy of predictions on HumVar is that nsSNPs assumed to be nondamaging in the HumVar dataset included a sizable fraction of mildly deleterious alleles. In contrast, most amino-acid replacements assumed nondamaging in the HumDiv dataset must be close to selective neutrality. Because alleles that are mildly but unconditionally deleterious may not be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which were assigned to opposite categories in HumVar data. Another reason is that the HumDiv dataset uses extra criteria (Supplementary Methods) to avoid possible erroneous annotations of damaging mutations.
PolyPhen-2 calculates the naive Bayes posterior probability that a given mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact nondamaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging or probably damaging (Supplementary Methods).
The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases require distinguishing mutations with drastic effects from other human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used to evaluate rare alleles at loci potentially involved in complex phenotypes, for dense mapping of regions identified by genome-wide association studies and for analysis of natural selection from sequence data, in which even mildly deleterious alleles must be treated as damaging.
Note: Supplementary information is available on the Nature Methods website.
References
Ramensky, V., Bork, P. & Sunyaev, S. Nucleic Acids Res. 30, 3894–3900 (2002).
Schmidt, S. et al. PLoS Genet. 4, e1000281 (2008).
Capriotti, E., Calabrese, R. & Casadio, R. Bioinformatics 22, 2729–2734 (2006).
Ng, P.C. & Henikoff, S. Nucleic Acids Res. 31, 3812–3814 (2003).
Bromberg, Y., Yachdav, G. & Rost, B. Bioinformatics 24, 2397–2398 (2008).
Yue, P., Melamud, E. & Moult, J. BMC Bioinformatics 7, 166 (2006).
Acknowledgements
We thank Y. Bromberg for help with the SNAP analysis. V.E.R. acknowledges support by the Russian Academy of Sciences Program in Molecular and Cellular Biology. This work was supported by the US National Institutes of Health (R01 GM078598 and in part by R01 MH084676).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–4, Supplementary Tables 1–2, Supplementary Methods (PDF 646 kb)
Supplementary Software
PolyPhen-2 standalone software for Linux/Mac OS X (ZIP 414 kb)
Rights and permissions
About this article
Cite this article
Adzhubei, I., Schmidt, S., Peshkin, L. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248–249 (2010). https://doi.org/10.1038/nmeth0410-248
Issue Date:
DOI: https://doi.org/10.1038/nmeth0410-248
This article is cited by
-
Whole-genome resequencing of Chinese indigenous sheep provides insight into the genetic basis underlying climate adaptation
Genetics Selection Evolution (2024)
-
Advances in understanding the genetic architecture of antibody response to paratuberculosis in sheep by heritability estimate and LDLA mapping analyses and investigation of candidate regions using sequence-based data
Genetics Selection Evolution (2024)
-
A novel missense COL9A3 variant in a pedigree with multiple lumbar disc herniation
Journal of Orthopaedic Surgery and Research (2024)
-
Mitochondrial point heteroplasmy: insights from deep-sequencing of human replicate samples
BMC Genomics (2024)
-
Whole genome sequencing in clinical practice
BMC Medical Genomics (2024)