In principle, given the amino acid sequence of a protein, it is possible to compute the corresponding three-dimensional structure. Methods for modelling structure based on this premise have been under development for more than 40 years. For the past decade, a series of community wide experiments (termed Critical Assessment of Structure Prediction (CASP)) have assessed the state of the art, providing a detailed picture of what has been achieved in the field, where we are making progress, and what major problems remain. The rigorous evaluation procedures of CASP have been accompanied by substantial progress. Lessons from this area of computational biology suggest a set of principles for increasing rigor in the field as a whole.
In the 1950s, work by Anfinsen & colleagues conclusively showed that the information determining the three-dimensional structure of a protein molecule is contained in the amino acid sequence. Recognition of this relationship rapidly led to the development of methods for computing structure from sequence. There were many early encouraging reports of partial success, starting in the 1960s and continuing through the 1970s and 1980s. And yet, during this long period, there were very few reports of computed structures in any way competing with those obtained experimentally. The mismatch between apparent success and the lack of useful applications suggested that the traditional peer reviewed publication system is not sufficient to ensure rigor in this area of computational biology. The Critical Assessment of Structure Prediction (CASP) experiments were devised as a means of addressing the specific needs of methods evaluation in structure modelling. CASP is one of a number of ways in which this problem may be addressed. As discussed later, the fundamental differences between computational and experimental biology dictate that new procedures be adopted in the field as a whole.
CASP is a community wide experiment with the goal of assessing the effectiveness of methods for modelling protein structure. The aims are to provide detailed information about the strengths and weaknesses of current structure modelling methods, to identify where progress has been made, to show where there are serious bottlenecks to further progress, and to indicate how these may eventually be removed. Key features are:
The use of bona fide blind predictions, rather than the previous practice of reproducing already known structures.
Participants provide models for the same set of proteins, greatly facilitating comparison of performance.
Predictions are made on a reasonably large set of proteins, reducing the impact of case specific artefacts.
There are multiple independent approaches to evaluation, reducing bias.
All models and analysis results are freely available to all, allowing maximum use to be made of the data.
The experiment has been conducted every 2 years since 1994 (CASP1), with the most recent one taking place in 2004 (CASP6). Information about soon-to-be experimentally determined protein structures is collected, and passed on to registered predictors. More than 200 prediction teams from 24 countries participated in CASP6, providing over 30 000 predictions on 90 protein domains. Predictions are evaluated using a battery of numerical criteria (Zemla et al. 2001) and more importantly, are carefully examined by independent assessors. A conference is held to discuss the results, and a special issue of the journal Proteins is published, with articles by the assessors and by some of the more successful prediction teams. Details for the 5th experiment can be found in the most recent journal issue (Moult et al. 2003). In particular, articles by the three assessment groups (Aloy et al. 2003; Kinch et al. 2003; Tramontano & Morea 2003) provide a detailed overview of the state of the art at that time, and another article puts the results in the context of previous CASPs (Venclovas et al. 2003). The Proteins issue for the sixth experiment will appear in early 2006. All participant registration, target management, prediction collection and numerical analysis are handled by the Protein Structure Prediction Center (Zemla et al. 2001). The Center web site (predictioncenter.org) provides access to details of the experiment and all results. A second web site (www.forcasp.org) provides a discussion forum for the CASP community.
3. Classes of structure prediction difficulty
Early work in the structure modelling field focused on understanding the nature of the natural protein folding process, and on the development of physics based force fields to determine the relative free energy of any conformation of a polypeptide chain. These methods were much in evidence at the first CASP, but have largely been supplanted by more successful ‘knowledge based’ approaches, which use the large and growing set of experimentally determined structures and sequences, in a variety of ways. As a consequence, accuracy of models depends on similarity to already known structures, and the number of related sequences that are available. Based on this consideration, CASP considers three classes of modelling difficulty, discussed in the following sections.
4. Comparative modelling based on a clear sequence relationship
For cases where there is an easily detectable sequence relationship between a target protein and one or more of known structure (a highly statistically significant score from a BLAST search; Altschul et al. 1990), an accurate core model (typically 2–3 Å RMS error on Ca atoms) can be obtained by copying from the structural template or templates (Tramontano & Morea 2003). Copying is often non-trivial, requiring a correct alignment of the target and template sequences. Improvements over the CASPs have resulted in largely correct alignments in this modelling zone. A single template structure rarely provides a complete model. Alternative templates may provide some additional structural features, and short regions of chain (‘loops’) are sometimes modelled in an approximately correct manner. Generally, reliably building regions of the structure not present in a template remains a challenge. Side chain conformations are very tightly correlated with backbone conformation (Chung & Subbiah 1995), so not surprisingly, side chain accuracy in these approximate models is poor.
A typical CASP6 comparative model is shown in figure 1, for Target 266, an Aeropyrum pernix homologue of the Haemophilus influenzae proline tRNA editing enzyme (An & Musier-Forsyth 2004). For large regions of the structure the template provides an accurate guide, resulting in good overall quality. Two non-template loop regions (A and B) are successfully modelled. The largest differences between the template and the target are in two helices (H1 and H2) flanking the active site, suggesting different substrate specificities. The best models leave the helices in the template orientation, so it is not possible to analyse possible specificity differences. In general, although the structure around active sites is usually well conserved between proteins with the same specificity, it is often the least conserved when the specificities differ.
While large parts of this class of model are approximately correct, they require refinement to be competitive with experiment, and to reproduce key functional features. Refinement remains the principal bottleneck to progress, and is now receiving a large amount of attention. In spite of limitations, these models are very useful for a variety of purposes, often identifying which members of a protein family have the same detailed function, and which are different (DeWeese-Scott & Moult 2004).
5. Modelling based on more distant evolutionary relationships
A second class of model quality is provided by those cases where an evolutionary relationship can be detected with more sophisticated methods than just BLAST. The core of these methods is alignment of a set of sequences, so that the characteristics of protein families may be used to detect relationships (Altschul et al. 1997; Karplus et al. 1998; Karplus & Hu 2001; Marti-Renom et al. 2004; Kahsay et al. 2005). Structural information is also used in a number of ways to enhance the detection of homologues (Sippl 1993; Bates et al. 2001; Karplus et al. 2003; McGuffin & Jones 2003; Venclovas 2003; von Grotthuss et al. 2003; Przybylski & Rost 2004; Wrabl & Grishin 2004).
Models based on the detection of these more distant relationships are limited in accuracy by four factors: identifying suitable structural templates, accuracy of alignment of sequence onto a template, conformational differences between the core template and target structures, and the difficulty of modelling regions of the target not available from a template. Nevertheless, methodological improvements together with the increased size of the pool of known structures and sequences has resulted in a steady improvement in model quality over the course of the CASP experiments. Further progress will depend on two main factors: first, effective application of template free modelling methods to those regions not found in a template. As outlined below, improvements in that area make this possible. The second factor is accurate alignment. This will likely require refinement at an all-atom level, since the information needed to distinguish between alternative alignments is contained in the detailed atomic interactions.
Although these models are not highly accurate, they nevertheless are useful for providing an overall idea of what a structure is like, helping choose residues for mutagenesis experiments, for example. They also often establish evolutionary relationships to more studied proteins, and so provide valuable approximate information about molecular function. Figure 2 shows an example from CASP6.
6. Modelling of new folds
For proteins with folds that have not previously been found, and those where no relationship to a protein of known structure can be detected, a different set of methods are needed. Traditionally, this was the area where physics based approaches were used. These methods are still used by a few CASP participants, but have been largely displaced. Newer methods primarily utilize the fact that although we are far from observing all folds used in biology (Coulson & Moult 2002), we probably have seen nearly all substructures (Du et al. 2003). Methods make use of these partial structure relationships on a range of scales (Bystroff et al. 2004), from a few residues (Rohl et al. 2004), through secondary structure units, to super-secondary units (Jones & McGuffin 2003). Structure fragments are chosen on the basis of compatibility of the substructure with the local target sequence and compatibility of secondary structure propensity. Since the sequence/structure relationship is rarely strong enough to completely determine the structure of fragments (Bystroff et al. 1996), a range of possible conformations for each fragment are usually selected, and many possible combinations of sub-structures considered. Initial structures are assembled from fragments, and approximate potentials are used to guide a conformational search process, together with other information, such as prediction of residue contacts (Aloy et al. 2003). A large number of possible complete structures (1000–100 000) are usually generated. The most successful package using this strategy is Rosetta (Rohl et al. 2004). For proteins of less than about 100 residues, these procedures may produce one or a few approximately correct structures (4–6 Å RMSD on Ca atoms). Selecting the most accurate structures from the large set of candidates is currently not a fully solved problem, and most methods rely on clustering procedures, selecting representative structures at the centre of the largest clusters of generated candidates (Skolnick et al. 2001). Reliable identification of accurate models will require the use of refined all-atom models. Thus, in this class of modelling too, the development of atomic level refinement methods is likely crucial to major progress.
In CASP1, all new fold models were close to random. There has been steady improvement over the CASPs, and by CASP6 most non-homology targets less than 100 residues have models that visual inspection shows to resemble experiment. An example is shown in figure 3. Models for larger proteins or domains are still rarely usefully accurate. Thus, while there is very impressive progress for small proteins, there is still a long way to go before all proteins can be modelled at that level. Also, although topologically pleasing, these models often have significant alignment and other errors. Nevertheless, progress over the decade of CASP has been very impressive.
7. Major current challenges
Overcoming four of the current major bottlenecks—producing close evolutionary relationship models approaching experimental accuracy, improved alignments, refinement of remote evolutionary relationship models, and reliable discrimination between possible template free models—depends on the development of effective all-atom structure refinement procedures. The ‘refinement’ problem has received increasing attention in recent years (http://www.nigms.nih.gov/psi/reports/comparative_modeling.html). At CASP6, for the first time, there was a report of an initial model refined from a backbone RMSD of about 2.2–1.6 Å, with many of the core side chains correctly oriented (Schueler-Furman et al. 2005). The same technology has been effective in protein design (Schueler-Furman et al. 2005), and in protein–protein docking (Schueler-Furman et al. 2005).
8. Lessons for computational biology
The practical and philosophical principles of experimental science evolved over hundreds of years, and have resulted in a system that ensures rigor and reproducibility. Experience in computational studies of protein structure suggests that these principles are not sufficient for computational modelling in biology. The fundamental difference is that modelling does not deal directly with the real world, instead creating some form of artificial reality. Additional steps are necessary to firmly establish the relationship between the artificial and real worlds. These steps are of two types. First, proper and appropriate statistical procedures must be used. In this respect, the computational biology field has become increasingly technically sophisticated in recent years. Second, care must be taken that the model does indeed represent the real world in all relevant respects. This latter issue has received less attention. The procedures outlined below, if widely adopted, will put computer modelling in biology on a par with the experimental work.
Bona fide predictions of experimental observations. Wherever possible, this mechanism should be used, rather than reproduction of known facts. Implementation requires that new experimental data be available on an appropriate time scale. CASP makes use of the high rate of release of new experimental structures, particularly those generated in structural genomics (http://www.nigms.nih.gov/psi/). CAPRI, a community protein–protein docking experiment (Janin 2005) makes use of new structures of complexes.
Bona fide prediction on test sets derived through human analysis. In areas where new experimental data cannot be used, it is some times possible to generate special test sets for bona fide prediction. This mechanism has been applied to genome sequence analysis (Reese et al. 2000). Human annotators examine a large set of data (genome sequence in this example), providing material that computational methods are then tested against.
Large test sets. Where reproduction of known information is the basis for testing, a large body of data produces more robust evaluation. Large test sets were rare in the early history of structure modelling. When they were used, for example in some cases of secondary structure prediction (Rost & Sander 1993), the results were reliable. The LiveBench system (Rychlewski & Fischer 2005) for evaluating protein structure modelling successfully incorporates this principle, encouraging participants to produce models of all newly released experimentally structures, and so accumulating large amounts of data.
Community agreed test sets. These can be developed in almost all areas of biological modelling. In CASP, participants agree to produce models of the same proteins, making methods comparison much easier. A more general example in the structure modelling field are decoy sets for testing protein structure discrimination methods, developed by a number of groups (http://dd.compbio.washington.edu/).
Independence of training and test sets. Parameterizing a method on the same data used for its evaluation will often lead to overestimates of accuracy, particularly where machine learning is employed. The principle of separate training and test sets is well established in statistics. It is appreciated in computational biology, but so far not always adhered to in practice.
Error estimates. In experimental science, provision of uncertainties in any measured quantity is considered mandatory. In computational work, including structure modelling, this is so far rare. There are striking exceptions, such the establishment of reliability estimates for interpreting DNA gels (Ewing & Green 1998). In this case, a reliability estimate played a critical role in developing high throughput sequence methods.
Independent tests of accuracy. All accuracy evaluation procedures have biases, so independent validation should be performed whenever possible. For example, when two unrelated methods have been developed, it is possible to validate by comparison. The specificity of the two methods predicts the fraction of cases where the two methods should agree, and the sensitivity predicts the expected fraction of all cases where at least one method should be correct.
Open results. All data associated with a method should be released, including full evaluation details and results, rather than just summaries. Ease of distributing information electronically has made this a practical procedure.
Open software. In experimental science, the principle of providing sufficient information to reproduce results has long been accepted and broadly adhered to. The equivalent in computational science includes release of software. There is considerable resistance to this, and it has not so far been possible in CASP. The primary reasons for non-release are protection of intellectual property and trade secrets, the resource commitment required to make software robust enough for distribution, or the dangers of abuse (unacknowledged use, or incorrect use leading to substandard results). These may be legitimate concerns, but without software in some form, it is impossible to rigorously check the performance of a method, and there is massive duplication of effort.
CASP is made possible by the participation of the prediction community, the generosity of the experimental community in making new structure information available, and the work of the assessment teams and the organizers. Details of the large number of people involved are available on the CASP web site (predictioncenter.org). CASP has been supported by grants from the National Library of Medicine (LM07085 to K. Fidelis), NIH R13GM/DK61967 (to J.M.) and R13GM072354 (to B. Rost).
One contribution of 15 to a Discussion Meeting Issue ‘Bioinformatics: from molecules to systems’.
- © 2006 The Royal Society