To the editor:

Applications of rapidly advancing sequencing technology exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon-capture techniques will direct sequencing efforts to the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow.

Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/, Supplementary Software), for predicting damaging effects of missense mutations. PolyPhen-2 is different from the earlier tool PolyPhen1 in the set of predictive features, the alignment pipeline and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1), which were selected automatically by an iterative greedy algorithm (Supplementary Methods). The majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele. The alignment pipeline selects a set of homologous sequences using a clustering algorithm and then constructs and refines its multiple alignment (Supplementary Fig. 1). The most informative predictive features characterize how likely the two human alleles are to occupy the site given the pattern of amino-acid replacements in the multiple-sequence alignment; how distant the protein harboring the first deviation from the human wild-type allele is from the human protein; and whether the mutant allele originated at a hypermutable site2. The functional importance of an allele replacement is predicted from its individual features (Supplementary Figs. 2, 3, 4) by a naive Bayes classifier (Supplementary Methods).

Figure 1: PolyPhen-2 pipeline and prediction accuracy.
figure 1

(a) Overview of the algorithm. MSA, multiple sequence alignment. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using fivefold cross-validation on HumDiv and HumVar3 data, using UniRef100 and Swiss-Prot databases for the homology search. Also shown are ROC curves for PolyPhen on HumDiv and HumVar calculated from the difference between position-specific independent counts (PSIC) scores1 of the wild-type and the mutant amino acids. (c) ROC curves for PolyPhen-2 trained on HumDiv and tested on a subset of HumVar data nonoverlapping with HumDiv data. UniRef100 and Swiss-Prot databases were used for the homology search. Also shown are ROC curves obtained using the programs sorting intolerant from tolerant (SIFT)4, screening for nonacceptable polymorphisms (SNAP)5 and SNPs3D6 on HumVar data. Methods other than PolyPhen-2 and PolyPhen could not easily be applied to HumDiv data because using the same sequences for obtaining both multiple alignments and nondamaging replacements must be avoided. SIFT was used in conjunction with Swiss-Prot database, SNAP and SNPs3D were used with their corresponding default databases. We used SIFT with Swiss-Prot database for homology search since Swiss-Prot does not contain sequences of splice forms, sequences of human allelic variants and incomplete sequences, making it possible to guarantee that allelic variants used in testing datasets would not appear in multiple-sequence alignments.

We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles annotated in the UniProt database as causing human Mendelian diseases and affecting protein stability or function, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be nondamaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt and 8,946 human nonsynonymous single-nucleotide polymorphisms (nsSNPs) without annotated involvement in disease, which we treated as nondamaging.

We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to that of PolyPhen (Fig. 1b) and it also compared favorably with that of three other popular prediction tools4,5,6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieved true positive prediction rates of 92% and 73% on HumDiv and HumVar datasets, respectively (Supplementary Table 2).

One reason for the lower accuracy of predictions on HumVar is that nsSNPs assumed to be nondamaging in the HumVar dataset included a sizable fraction of mildly deleterious alleles. In contrast, most amino-acid replacements assumed nondamaging in the HumDiv dataset must be close to selective neutrality. Because alleles that are mildly but unconditionally deleterious may not be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which were assigned to opposite categories in HumVar data. Another reason is that the HumDiv dataset uses extra criteria (Supplementary Methods) to avoid possible erroneous annotations of damaging mutations.

PolyPhen-2 calculates the naive Bayes posterior probability that a given mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact nondamaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging or probably damaging (Supplementary Methods).

The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases require distinguishing mutations with drastic effects from other human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used to evaluate rare alleles at loci potentially involved in complex phenotypes, for dense mapping of regions identified by genome-wide association studies and for analysis of natural selection from sequence data, in which even mildly deleterious alleles must be treated as damaging.

Note: Supplementary information is available on the Nature Methods website.