In eukaryotic cells, protein phosphorylation is an important and widespread mechanism used to regulate protein function. Yet, of the thousands of phosphosites identified to date, only a few hundred at best have a characterized function. It was recently shown that these functional sites are significantly more conserved than phosphosites of unknown function, stressing the importance of considering evolutionary conservation in assessing the global functional landscape of phosphosites. This leads us to review studies that examined the impact of phosphorylation on evolutionary conservation. While all these studies have shown that conservation is greater among phosphorylated sites compared with non-phosphorylated ones, the magnitude of this difference varies greatly. Further, not all studies have considered key factors that may influence the rate of phosphosite evolution. Such key factors are their localization in ordered or disordered regions, their stoichiometry or the abundance of their corresponding protein. Here we take into account all of these factors simultaneously, which reveals remarkable evolutionary patterns. First, while it is well established that protein conservation increases with abundance, we show that phosphosites partly follow an opposite trend. More precisely, Saccharomyces cerevisiae phosphosites present among abundant proteins are 1.5 times more likely to diverge in the closely related species Saccharomyces bayanus when compared with phosphosites present in the 5 per cent least abundant proteins. Second, we show that conservation is coupled to stoichiometry, whereby sites frequently phosphorylated are more conserved than those rarely phosphorylated. Finally, we provide a model of functional and noisy or ‘accidental’ phosphorylation that explains these observations.
(a) Phosphorylation as a widespread and important mechanism to regulate protein function
Proteins are the workhorses of the cell and their activity is regulated at multiple levels . Among the best-described modes of protein regulation are the post-translational modifications of specific residues that affect their stability, localization, activity and ability to interact. The most studied and arguably one of the most abundant types of post-translational modifications is protein phosphorylation, as reflected by the large number of genes that encode kinases and phosphatases. For example, in humans, about 700 proteins play a direct role in phosphorylating and dephosphorylating other proteins [2,3].
Because of the importance of protein phosphorylation in regulating protein function, significant efforts have recently been dedicated to probe, map and catalogue phosphoproteomes [4–24]. These studies have revealed several important characteristics about phosphoproteomes, including that (i) protein phosphorylation is common: at least 30 per cent of proteins have been found to be phosphorylated on at least one residue in organisms such as the budding yeast (2000 out of 6000 proteins ); (ii) some proteins are heavily phosphorylated, with more than 30 sites for some proteins ; (iii) phosphorylation is dynamic across growth conditions [4,10,17–20]. This deepening knowledge of the importance of protein phosphorylation has sparked a profound interest regarding the role of this type of regulation in evolution and whether changes in protein phosphorylation could play key roles during evolution of cellular and organismal phenotypes .
(b) A consensus view of the evolution of phosphorylation sites
Large-scale mass-spectrometry studies of protein phosphorylation in several model organisms have also enabled the study of the rate at which this post-translational modification evolves. Despite the fact that many studies have used different species as the basis of their comparisons and have used different metrics to estimate the rate at which phosphorylated residues evolve, a consensus emerges from these investigations (table 1). First, most studies (e.g. [15,30,32] including Landry et al. ), show that phosphorylation sites (pS/pT) are on average more conserved than non-phosphorylated equivalent residues (S/T). Second, not all pS/pT show signs of purifying selection and the data are consistent with a significant proportion of pS/pT evolving at a rate that is indistinguishable from that of equivalent non-phosphorylated S/T [25,32,34,35]. Finally, sites that are known to be functional (associated with a phenotype when mutated) are significantly more conserved than sites with no characterized function [25,28,31,33]. Landry et al.  and others [28,36] have concluded from these results that a significant fraction of sites identified in large-scale studies have little to no functional role and could result from the random and neutral birth and death of sites that kinases can phosphorylate by accident (off-target activity).
From a practical perspective, among the thousands of phosphosites commonly observed in phosphoproteomic studies, biologists need to make educated predictions about which sites are most likely to be functionally important. As described already, phosphosite conservation is a good indicator of functional importance. However, we know that factors such as protein disorder and protein abundance are tightly coupled to each other  as well as to varying degrees of protein conservation [37,38]. Therefore, we anticipate that the successful identification of phosphosites most likely to be functional requires taking into account a more global cellular context for each site. This leads us to discuss and investigate the existence of phosphosites with little functional relevance and their relationship to global cellular parameters, namely protein disorder, protein abundance and phosphosite stoichiometry.
(c) Organism fitness and non-functional phosphorylation are compatible with each other
From an evolutionary standpoint, it may be argued that off-target phosphorylation should be eliminated. However, such off-target activity would be eliminated by natural selection only to the extent that it is deleterious for the cell and that other important conditions that affect the efficiency of natural selection be met . In fact, off-target or ‘promiscuous’ phosphorylation might be functionally neutral but in the long run might contribute to the evolvability of signalling networks . These non-adaptive phosphorylations may represent a burden for investigators in terms of identifying key functional phosphosites in the cell but may nevertheless contribute to the evolution of biological network architecture and therefore deserve to be considered if we are to fully understand the evolution of signalling and protein interaction networks [39,41–44].
(d) Linking cellular crowding and random encounters to non-functional phosphorylation
When considering that random encounters may result in non-functional phosphorylation of S/T, one key factor to consider is protein abundance. In cells, protein abundances span several orders of magnitude. In yeast for example, while some proteins are present in only a few dozen copies, others are synthesized in over a million copies . Therefore, competition is severe between sites phosphorylated in the context of a specific function and other sites for which phosphorylation has little or no functional consequence [39,46–49]. For example, let us consider protein X, whose average abundance is approximately 1000 copies per cell, and whose function requires a phosphorylation at a particular position. Knowing that a yeast cell contains an estimated total of 50 000 000 protein molecules , how likely is it for a kinase to be able to specifically phosphorylate all the copies of ‘protein X’ without any off-target activity among the 49 990 000 other proteins present in the cell?
Answering this question requires considering two opposing forces: specificity mechanisms and the randomness of cellular encounters. On the one hand, protein X might be phosphorylated without any off-target activity thanks to mechanisms that increase the specificity of kinase–protein interactions and phosphorylation . This is for instance achieved by subcellular organization of kinase and phosphatase complexes with their substrates via scaffold and adaptor proteins [52,53]. On the other hand, the kinase(s) responsible for the phosphorylation of X may also randomly encounter other proteins within its subcellular compartment and, further, many kinases are active in more than one compartment . In the latter scenario, the probability of random encounters is expected to be proportional to the abundance of proteins present in the same cellular compartment. Considering chance alone, it is indeed significantly more likely to encounter a protein present in a million copies than the rare protein X. In addition, the random nature of these encounters and low probability that they result in phosphorylation imply that only a small fraction of a protein population should be phosphorylated by chance. These concepts are summarized in figure 1.
In sum, if protein phosphorylation is influenced by the randomness of cellular encounters, we expect two factors to deserve attention: abundance and stoichiometry, where abundance refers to the total copy number per cell of a given protein, and stoichiometry to the fraction of the copies that are phosphorylated. Considering these two factors, we formulate the following predictions regarding non-functional phosphorylation sites:
(1) If protein abundance increases the probability of random encounters with protein kinases and therefore of off-target phosphorylation (figure 1), we predict that highly abundant proteins are enriched in phosphorylation sites.
(2) Since off-target phosphorylations exhibit no specific function, conservation of phosphosites on highly abundant proteins should be equivalent to that of non-phosphorylated sites. We thus predict that the ratio C(pS|pT)/C(S|T) decreases as protein abundance increases (where C denotes conservation, pS|pT phosphorylated serines and threonines and S|T non-phosphorylated equivalent residues).
(3) Off-target phosphorylation should only affect a small fraction of a protein population, i.e. the stoichiometry at these sites should be low owing to the random nature of the encounters and to the low probability that phosphorylation occurs. We thus predict that highly abundant proteins are enriched in phosphorylation sites of low stoichiometry.
(4) Following prediction 3, we predict that the ratio C(pS|pT)/C(S|T) increases with stoichiometry.
We investigate these predictions in the upcoming sections.
2. Results and discussion
(a) Highly abundant proteins are enriched in phosphorylation sites
We first ask whether abundant proteins are more frequently subject to phosphorylation than those with lower abundances (figure 1). For this, we used the dataset of 6092 phosphosites assembled in Landry et al.  (see §4) and divide proteins into 10 classes of increasing abundance (0–10%, 10–20%, etc. upto 90–100%; i.e. dataset percentiles). In the least abundant class, 22 per cent of proteins are detected as being phosphorylated on at least one position in a predicted disordered region—we denote this Plow+diso = 0.22. This proportion increases among higher abundance proteins, although it does not exceed 40 per cent: Phigh+diso ∼ 0.35–0.4). Therefore, the ratio Phigh+diso/Plow+diso increases with abundance but remains below two. When considering ordered regions, this ratio increases even more, as less than 10 per cent of low-abundance proteins contain at least one known phosphorylation site in an ordered region (Plow+ord < 0.1), while almost 50 per cent of proteins in the most abundant class do so (Phigh+ord ∼ 0.5)—therefore the ratio Phigh+ord/Plow+ord is close to five. The weaker contrast observed in disordered regions is due to a combination of the following opposing trends: (i) most phosphosites are in disordered regions, but (ii) abundant proteins contain few disordered regions. We thus control for these effects in figure 2b as well as in figure 3, as described next.
First, we move from a protein perspective to an amino acid perspective and calculate the percentage of all S/T that are phosphorylated among proteins sampled in a given abundance class. We find that the amino acid perspective significantly increases the contrast between abundance classes. While approximately 1 per cent of S/T residues in disordered regions are phosphorylated among low-abundance proteins, this fraction increases to over approximately 7 per cent among highly abundant proteins. Similarly, in ordered regions, approximately 0.2 per cent of S/T residues are known to be phosphorylated among low-abundance proteins versus 2.25 per cent among highly abundant proteins. Second, we create nine categories of proteins according to their abundance and proportion of disordered residues (figure 3), and calculate the ratio of observed to expected phosphorylation sites in each category. This reveals that highly abundant disordered proteins contain more than three times the number of phosphosites expected by chance. In contrast, proteins of low abundance and with little disorder contain less than three times the number of sites expected by chance. A combination of disorder and high expression therefore results in an increased probability of being phosphorylated.
Together, these results show that serines and threonines do not all have the same probability of being phosphorylated. It is known that ordered and disordered regions are associated with different probabilities of S/T sites being phosphorylated [25,55,56]. Here we see that the impact of protein abundance on these probabilities is also strong, and in fact it acts in synergy with disorder (figure 3).
Although this observation is consistent with our first prediction, the trends observed can also be explained by detection methods (i.e. frequent peptides might more likely be detected than rare peptides, despite the fact that phosphopeptide enrichment strategies are employed in these studies). Until methods that can circumvent this limitation are developed, it will be difficult to confirm this observation. This prompts us to probe an additional aspect of phosphorylation sites: their evolutionary conservation.
(b) Conservation of phosphosites on highly abundant proteins is similar to that of non-phosphorylated equivalent sites
To perform the evolutionary analyses described later, we compared aligned protein sequences from Saccharomyces cerevisiae and Saccharomyces bayanus (see §4). These two species have diverged for about 20 Myr and provide a good comparison because they have the same gene content, their orthology calls are unambiguous and yet their level of nucleotide divergence is comparable to that of human and mouse (62 versus 66% nucleotide identity in aligned positions ). This limited divergence also reduces alignment problems, which can be problematic in species that have diverged for longer periods of time, especially in disordered regions of proteins where the majority of phosphosites are found.
It was shown over a decade ago  and confirmed in several studies since then [38,59] that highly abundant proteins evolve slower than those of lower abundance. The comparison of S. cerevisiae and S. bayanus phosphorylatable residues (S/T) provides the same result (figure 4). In the least abundant half of the proteome, and considering disordered regions first, less than 70 per cent of S and T are conserved on average in S. bayanus (with an S or a T). This fraction increases to 90 per cent when considering the 5 per cent most abundant proteins. However, we observe an opposite trend for pS and pT sites: the conservation in the [0–5%] abundance class is equal to 86 per cent, and decreases to 77 per cent in the [60–80%] abundance class. Therefore, the ratio C(pS|pT)/C(S|T) is highest among low-abundance proteins, decreasing progressively until C(pS|pT)/C(S|T) approaches approximately one among highly expressed proteins (figure 4b). In ordered regions, though it is much weaker, the global trend is identical. pS/pT sites are thus significantly more conserved than equivalent non-phosphorylated S/T in low-abundance proteins, but not in highly expressed proteins.
These observations are compatible with prediction 2, where a significant proportion of phosphorylation events detected among highly abundant proteins results from off-target kinase activity, while events detected among low-abundance proteins are enriched for important functions. The presence of non-functional phosphorylation sites in abundant proteins is further supported by the comparison with the proportion of conserved sites among a set of curated and arguably functional phosphosites obtained from  (green circle in figure 4a: ndiso = 159; nordered = 33). Conservation for these sites is indeed much higher than what is seen in large-scale phosphoproteomic studies. It has also been hypothesized  and shown  that an overall weak conservation of phosphosites can be explained by compensatory mechanisms. Cases are indeed known where the position of a phosphosite changes [60–62] although the functional mechanisms associated with it remain conserved. Importantly, we do not expect the earlier-mentioned observation to be affected by this evolutionary scenario, because such compensation (and associated lower conservation) is expected to increase among low-abundance proteins (because these proteins evolve faster and are richer in disordered regions, when compared with highly abundant proteins), and we observe the opposite trend.
(c) Highly abundant proteins are enriched in phosphorylation sites of low stoichiometry
In the scenario where highly abundant proteins are off-targets for phosphorylation, we do not expect the corresponding phosphosites to exhibit high stoichiometries. In other words, off-target phosphorylation should affect only a small fraction of an entire protein population. Estimates of phosphosite stoichiometry were recently published  and allow us to investigate this hypothesis.
As previously described, we split proteins into classes of increasing abundance, but now measure the mean stoichiometry of their corresponding phosphosites (figure 5a). This analysis reveals a clear relationship between these two variables, with phosphosites on low-abundance proteins having an average stoichiometry close to or more than 50 per cent (for both ordered and disordered regions), while for those on highly abundant proteins the average stoichiometry is down to 15 per cent (disordered regions) and 8 per cent (ordered regions). If we now look at the opposite picture and measure abundance as a function of stoichiometry (figure 5b), we observe a similar trend, where sites of low stoichiometry are more frequently observed on abundant proteins than sites with high stoichiometry. The figure also highlights that this effect is concentrated on sites of very low stoichiometry (]0–10%]), and remains comparatively moderate otherwise (green area). This confirms prediction 3, i.e. that highly abundant proteins are enriched in phosphorylation sites of low stoichiometry, as would be expected from off-target phosphorylation.
(i) Phosphosites with high stoichiometry are more conserved than those with low stoichiometry
We have previously observed that phosphosites with high stoichiometry are more conserved than other phosphosites with lower stoichiometry detected on the same peptide . Here, the data from Wu et al.  provide relative stoichiometry information at the proteome scale, which allows us to compare stoichiometry values across different peptides. We thus integrate stoichiometry and evolutionary information to ask whether higher stoichiometry of phosphorylation is associated with higher site conservation. Importantly however, answering this question is difficult, as it involves two confounding trends:
— low stoichiometry is strongly associated with high abundance,
— protein conservation is also strongly coupled to abundance, with high abundance equating with high global conservation.
Therefore, in order to investigate the relationship between stoichiometry and conservation, we focus on the range highlighted in green in figure 5, where the coupling between stoichiometry and abundance remains moderate.
We plot the average phosphosite conservation for three stoichiometry classes: ]10–30%], ]30–80%] and ]80–100%] (figure 6). This is shown with orange diamonds, and grey diamonds indicate the conservation of equivalent but non-phosphorylated sites. The black line, shows the ratio C(pS|pT)/C(S|T) as in figure 4b. For both ordered and disordered regions, the ratio increases from the low to high-stoichiometry classes, which is consistent with prediction 4, i.e. that low-stoichiometry sites are enriched in non-functional protein phosphorylation.
In eukaryotic cells, a large fraction of proteins are post-translationally modified, and each of these modifications has the potential to affect their function. The ever-increasing throughput, sensitivity and accuracy of mass spectrometers together with the development of new proteomics strategies [6–8] means that these post-translational modifications can be identified on a large scale in an unbiased fashion with respect to the function (or absence thereof) of these modifications. Further, protein and phosphopeptide enrichment strategies are ever-improving the overall sensitivity of mass-spectrometry-based phosphoproteomics. For instance, a recent study on the phosphoproteome of purified yeast centromeres revealed many phosphosites previously not observed in whole-cell studies . Such subcellular analysis of compartments and protein complexes should reveal even more sites, among both low- and highly abundant proteins having residues that are only phosphorylated within specific complexes or under specific circumstances. Probing cellular functions and responses on a large-scale may ultimately lead to our understanding of cells as a system. However, this task has proved difficult. Crucial to our success is the need to consider the contribution of non-adaptive forces (mutation and genetic drift) and biophysical constraints in shaping cellular networks . Such forces are indeed likely to give birth to non-functional elements [39,41]. Because genomic elements of prime importance are expected to be conserved over longer evolutionary times than non-functional elements, we can use evolutionary comparisons to focus on the most conserved and thus most likely functional phosphosites. In the context of protein phosphorylation, it has indeed been shown that phosphorylation sites of known function are significantly more conserved than those of unknown function.
Above and beyond the identification of functional phosphosites, it is of paramount importance to understand what might influence the occurrence of non-functional phosphorylation events in the cell. Here we proposed a simple model that explains global patterns of non-functional phosphorylation (figure 1). Our model proposes that an important factor to consider is protein abundance. Kinases are indeed more likely to randomly encounter abundant proteins rather than rare proteins, as reflected in the fact that abundant proteins are enriched in phosphorylation sites (hypothesis 1, figures 2, 3 and 7a). Accordingly, evolutionary data showed that abundant proteins are enriched in non-functional phosphosites, and that low-abundance proteins are enriched in functional ones (hypothesis 2, figure 4). Consistently also, abundant proteins are enriched in low-stoichiometry sites (hypothesis 3, figure 5), and when abundance is taken into account, phosphosites with higher stoichiometry are more conserved (and arguably more functional) than those with lower stoichiometry (hypothesis 4, figures 6 and 7b). Importantly, it is possible that current technical limitations contribute to the signals observed with respect to hypotheses 1 and 3, i.e. it might be more likely to detect peptides from highly abundant proteins than peptides from low-abundance proteins, which would result in an increased probability of highly abundant proteins to appear phosphorylated. This could also result in an apparent increased probability for highly abundant proteins to contain low-stoichiometry sites. Yet, we do not expect our results to be a simple consequence of such technical limitations because hypotheses 2 and 4 are based on evolutionary data, which is independent from such a potential bias, and which nevertheless corroborate hypotheses 1 and 3. Future technological developments will be needed to unambiguously address these questions.
Many mechanisms that contribute to the specificity of protein kinases have been described , and our results do not question the importance of these mechanisms for cellular functions. In fact, without these mechanisms, phosphorylation would certainly be much less efficient in regulating specific protein functions. However, it is likely that these mechanisms could not be optimized to a point where they can compensate for the three to four orders of magnitude differences we see in protein abundances and other effects not investigated here such as cellular crowding in specific compartments. One could argue that many more complex mechanisms could evolve to optimize these specificities further but they may actually represent a higher cost than the cost imposed by non-functional phosphorylation. There may also be a tradeoff between the pace at which protein kinases can phosphorylate their substrates (reactivity) and their specificity. Optimization of one parameter would inadvertently lead to reduction in the other and thus to a cost that counterbalances the benefits. Ultimately, measuring the cost of non-functional phosphorylation may be key to our full understanding of why it may be so frequent in the cell.
(a) Proteins: alignments, conservation, disorder and abundance
Protein sequences and alignments were identical to those used in Landry et al. , and were originally obtained from Wapinski et al. . However, we extracted the couple S. cerevisiae–S. bayanus from these alignments. A serine or threonine was considered conserved in S. cerevisiae (value = 1) if the corresponding amino acid in S. bayanus was also a serine or a threonine, and considered non-conserved (value = 0) if it was a non-phosphorylatable amino acid, a tyrosine or an indel. Disorder was obtained with DISOPRED , as described in Landry et al. . Protein abundances were obtained from the Pax-db database . This database provides abundances in arbitrary units. In order to relate these to protein copy numbers per cell, we use the copy numbers estimated in Ghaemmaghami et al. . Linear regression between the data from Pax-db and those from Ghaemmaghami et al.  results in the following relationship: log10(Pax) = −1.848 × log10(Ghaemmaghami). We thus multiply all values from Pax-db by 101.848 = 70.46. This is not intended to provide an accurate estimate and we only use it to obtain an approximation. Importantly, our analyses are based on relative abundances and thus do not depend on these absolute copy numbers. For figures 2, 4 and 5a, we indeed bin proteins into abundance classes, which represent percentiles of the entire dataset of proteins present in Pax-db.
(b) Phosphorylation sites
Phosphorylation sites were identical to those used in Landry et al.  and were originally compiled from the following studies: Ficarro et al. , Gruhler et al. , Reinders et al. , Chi et al. , Li et al. , Smolka et al.  and Albuquerque et al. . Note that we did not consider tyrosine phosphorylation in this study. The aggregation of these datasets yields 6092 phosphosites for which the conservation can be calculated. This dataset is used for figures 2–4.
(c) Phosphorylation stoichiometry
Phosphorylation site stoichiometry was obtained from Wu et al. , providing 4754 sites for which we have both stoichiometry information and conservation information. Note that we did not consider sites for which the stoichiometry information was ambiguous—that is, we considered sites annotated only by Wu et al. as being on ‘singly phosphorylated’ peptides. This yields a dataset of 3803 sites for which we have both stoichiometry information as well as conservation information. This dataset is used for figures 5 and 6.
(d) Comparing phosphorylated sites with equivalent non-phosphorylated sites
In order to compare the evolutionary conservation between phosphorylated and non-phosphorylated sites, it is important to consider ‘equivalent sites’, which we define as:
— present on the same proteins as those considered for phosphorylated ones,
— found within 10 amino acids of a known phosphorylation site, and
— found in the same structural region (ordered and disordered regions were never compared with each other).
(e) Confidence intervals
Evolutionary conservation is given in the form of zeros and ones. Given a set of sites, we calculate the mean conservation by the mean of the values. A 95% CI for the mean is approximated by the following formula derived from the central limit theorem: CI = 1.96 × ((p1 × (1 – p1))/(N/p1))0.5, where p1 is the estimated mean and N the number of sites used for its calculation.
E.D.L. acknowledges the Human Frontier Science Program for support through a long-term postdoctoral fellowship. C.R.L. is a CIHR new investigator and this work is supported by a CIHR grant (GMX-191597). This work was supported in part by a CIHR grant (GMX-192838) to S.W.M. We thank Alan Moses, Pedro Beltrao and Tony Hunter for reviews of the manuscript, and Steven Gygi for helpful advice regarding the data on phosphosite stoichiometry.
One contribution of 13 to a Theme Issue ‘The evolution of protein phosphorylation’.
- This journal is © 2012 The Royal Society