Royal Society Publishing


The frequencies of alternative synonymous codons vary both among species and among genes from the same genome. These patterns have been inferred to reflect the action of natural selection. Here we evaluate this in bacteria. While intragenomic variation in many species is consistent with selection favouring translationally optimal codons, much of the variation among species appears to be due to biased patterns of mutation. The strength of selection on codon usage can be estimated by two different approaches. First, the extent of bias in favour of translationally optimal codons in highly expressed genes, compared to that in genes where selection is weak, reveals the long-term effectiveness of selection. Here we show that the strength of selected codon usage bias is highly correlated with bacterial growth rate, suggesting that selection has favoured translational efficiency. Second, the pattern of bias towards optimal codons at polymorphic sites reveals the ongoing action of selection. Using this approach we obtained results that were completely consistent with the first method; importantly, the frequency spectra of optimal codons at polymorphic sites were similar to those predicted under an equilibrium model. Highly expressed genes in Escherichia coli appear to be under continuing strong selection, whereas selection is very weak in genes expressed at low levels.

1. Introduction

When the genetic code was decrypted in the 1960s, it became apparent that most amino acids are encoded by multiple (two to six) codons, which typically differ only at the third nucleotide of the codon. With the introduction of DNA sequencing in the late 1970s, it emerged that these alternative synonymous codons are not used with equal frequencies. Two phenomena were soon apparent: patterns of codon usage vary among species (Grantham et al. 1980), and in the model bacterium Escherichia coli (for which most data were available), codon usage is more biased in genes expressed at higher levels (Post & Nomura 1980; Gouy & Gautier 1982; see table 1). Both phenomena were interpreted as reflecting the action of natural selection.

View this table:
Table 1.

Codon usage in E. coli. Codon usage is compared between a set of 40 highly expressed genes (high; see Sharp et al. 2005) and the genome as a whole (all); the data are relative synonymous codon usage values (the ratio of the observed number to that expected if all codons for an amino acid were used equally). Nineteen codons occurring at significantly higher frequencies (see Henry & Sharp 2007) in the high dataset are shown in bold. The data are for E. coli strain K-12 MG1655 (accession number U00096).

The selective differences among synonymous codons reflect two aspects of the transfer RNA (tRNA) population present in the cell (Ikemura 1985). First, for some amino acids there are multiple species of tRNAs with different anticodons, and it is those codons translated by the most abundant tRNA species which are preferred in highly expressed genes. For example, there are five different Leu tRNAs in E. coli, but that with anticodon CAG is much more abundant than the others. This anticodon is complementary to the codon CUG, which is used nearly 20 times more often than any of the other five Leu codons in highly expressed genes (table 1). Second, many tRNAs can translate more than one codon, but with variable ability; the codon best recognized by the anticodon is preferred in highly expressed genes. For example, there is a single Phe tRNA in E. coli, with anticodon GAA, which translates both UUU and UUC; however, UUC is perfectly complementary to the anticodon, and is used about three times more often than UUU in highly expressed genes (table 1). Thus, from knowledge of the tRNA population it is possible to predict which codons are translationally optimal; namely, those that are best recognized by the most abundant tRNA species.

There has been much debate about exactly why translationally optimal codons are selected. The traditional view is that use of optimal codons increases the efficiency of translation (Ehrenberg & Kurland 1984; Andersson & Kurland 1990). Ribosomes constitute about two-thirds of the protein content of an E. coli cell when growing rapidly (Pedersen et al. 1978), and the abundance of ribosomes may be the main factor limiting growth rate. Optimal codons may be translated faster than non-optimal codons (Sørensen & Pedersen 1991), such that ribosomes move faster along an mRNA containing more optimal codons, and the ribosomes are more quickly released to be available to translate other mRNAs. Thus, use of optimal codons, especially in genes expressed at high levels encoding mRNAs that must be translated more often, allows more efficient use of ribosomes and leads to faster growth rate (Kudla et al. 2009), conferring an obvious selective advantage, at least in bacteria occupying certain niches.

An alternative view is that the use of optimal codons increases the accuracy of translation. Sites where the identity of the amino acid is more critical for protein function are expected to be more conserved across species, and also expected to be the sites where accuracy of translation is more important. The fruitfly, Drosophila melanogaster, exhibits stronger codon usage bias in more highly expressed genes, analogous to the situation in E. coli (Shields et al. 1988; Duret & Mouchiroud 1999), and it was found that codons for conserved amino acids have stronger codon bias in D. melanogaster (Akashi 1994). This accuracy hypothesis has the potential to explain the observation, otherwise surprising, that rates of non-synonymous and synonymous nucleotide substitution are correlated across genes in comparisons between E. coli and its close relative Salmonella enterica (Sharp 1991). Based on a variety of observations, some authors have concluded that translational accuracy is the primary object of codon selection in E. coli (Stoletzki & Eyre-Walker 2006), and indeed the dominant constraint on gene sequence evolution across both bacteria and eukaryotes (Drummond & Wilke 2008).

In this article we will focus on bacteria, examining the extent to which natural selection is responsible for the variations in codon usage seen among species and within genomes. In particular, we will contrast the results of two different approaches to estimating the strength of selection on codon usage bias. Finally, we will discuss the implications of the results, including their relevance to the efficiency versus accuracy debate introduced above.

2. Variation in codon usage bias among bacteria

Analyses of bacteria other than E. coli have revealed that codon usage patterns vary among species in a number of ways. Most of the differences appear to be due, ultimately, to variations in mutation biases. First, it had been known for half a century that base composition, summarized by G + C content in double-stranded DNA, varies greatly among bacteria (Belozersky & Spirin 1958). Among published bacterial genome sequences, values of G + C content range from 17 per cent in Carsonella ruddii (Nakabachi et al. 2006) to 73 per cent in Frankia alni (Normand et al. 2007). This variation has long been viewed as the primary influence on codon usage differences between species of bacteria (Bibb et al. 1984; Muto & Osawa 1987). This has been confirmed by multivariate analyses comparing total genomic codon usage among bacteria, which showed that the single most important source of variation is G + C content (Lynn et al. 2002; Chen et al. 2004). It has often been speculated that this variation reflects the action of selection. In particular, it has been suggested that there would be pressure on thermophilic bacteria to have more G + C-rich genomes, because they are more thermostable (Bernardi & Bernardi 1986; Musto et al. 2004). However, most analyses have failed to find any correlation between growth temperature and genomic G + C content (e.g. Galtier & Lobry 1997; Lynn et al. 2002). Overall, the variation in G + C content is most simply explained by subtle but persistent mutation biases (Sueoka 1962).

Second, genome sequencing has revealed that in many bacteria base composition varies systematically between the leading and lagging strands of replication, with the leading strand being more G + T-rich (Lobry 1996; McLean et al. 1998). This strand-specific bias impacts codon usage, but the strength of the effect varies considerably among species. In the spirochaetes Borrelia burgdorferi (the cause of Lyme disease) and Treponema denticola (the cause of syphilis), strand-specific bias dominates codon usage variation among genes (Lafay et al. 1999); in other species the effect is much weaker, or undetectable (Kloster & Tang 2008). The source of this strand-specific bias has been debated, but the predominant ideas concern mutation biases. The leading and lagging strands are replicated by different mechanisms with different mutation rates (Fijalkowska et al. 1998), which could lead to the observed differences in base composition. Alternatively, since there is an excess of genes located on the leading strand in many bacteria (Brewer 1988; Tillier & Collins 2000), biases in transcription-coupled repair could lead to a skew between the strands in nucleotide composition (Francino et al. 1996).

Third, for some amino acids, the identity of the translationally optimal codon varies among species. For example, in Clostridium perfringens, the codons heavily used in highly expressed genes (Musto et al. 2003) differ from those in E. coli (table 1) for six amino acids. These differences are correlated with changes in tRNA populations. While tRNA abundances have been measured for very few species, it is known that tRNA abundance is correlated with tRNA gene copy number (Kanaya et al. 1999), and so the latter may be used to predict the most abundant tRNAs. In the E. coli genome, where there are eight genes encoding five different Leu tRNAs, four genes encode the tRNA with the CAG anticodon (mentioned above as being the most abundant Leu tRNA species in E. coli). The C. perfringens genome also contains eight Leu tRNA genes (for four different tRNAs), but four encode the tRNA with anticodon UAA; the heavily used Leu codon is UUA, perfectly complementary to this predicted most abundant tRNA. Thus, there is co-adaptation between the codon usage of highly expressed genes and the tRNA population in both species, but the identity of the co-adapted state differs. Exactly how this divergence can occur is unclear, but it has been hypothesized that it could be driven by pressure from biased mutation patterns (Shields 1990).

Fourth, not all bacterial species exhibit the same clear trend in codon usage patterns associated with gene expression level. For example, in Helicobacter pylori (a bacterium that causes stomach ulcers), there is at most a very minor difference in codon usage between highly expressed and other genes (Lafay et al. 2000; Kloster & Tang 2008), while in B. burgdorferi most of the highly expressed genes are located on the leading strand of replication, and have G + T-rich codon usage that does not differ from other genes on that strand (Lafay et al. 1999). This difference among species most likely reflects variation in the extent to which natural selection is effective in shaping codon usage; this is the subject of the next two sections.

3. Variation in the strength of selected codon usage bias among bacteria

We have previously examined the strength of selected codon usage bias in 80 distinct bacterial species with genome sequences available (Sharp et al. 2005). To quantify the strength of selected codon usage bias, we modified a population genetic model (Bulmer 1991). The strength of past selection on codon usage can be estimated from the frequency of optimal codons in a gene, if the expected frequency of those codons in the absence of selection is known. Since, for some amino acids, the identity of the optimal codon varies among species, we focused on four amino acids where it is expected that the same codon would always be favoured by selection. For example, the only Phe tRNA genes known across bacteria have GAA at the anticodon site, and so UUC is always expected to be favoured over UUU, when selection is effective. Similarly, for Tyr, Asn and Ile, G at the critical position of the anticodon should always lead to preference for the C-ending rather than the U-ending codon. To determine the frequency of optimal codons in genes potentially under strong selection, we examined a standard set of 40 highly expressed genes (encoding translation elongation factors and ribosomal proteins) found in all bacterial species; these genes encode proteins with around 104–105 copies in the E. coli cell (Ishihama et al. 2008). To define an analogous set of genes present, and expressed at low levels, in all bacteria is more difficult. So we used the codon usage of the genome as a whole as an estimate of the pattern of codon usage when selection is weak; this can be justified because only a minority of genes within a genome are highly expressed. Comparison of codon usage between these two datasets yields an estimate of the compound parameter S = 2Nes, where Ne is the effective population size and s is the selective difference between optimal and non-optimal codons. Thus, S might vary among species because there have been differences in either their population sizes or the strength of selection.

Application of this approach to 80 bacterial genomes revealed considerable variation among species (Sharp et al. 2005). The S value for E. coli was 1.49. In 24 species (30% of the total), including H. pylori, the S value was not significantly greater than zero, providing no evidence for selected codon usage bias. Thirty species (37.5%) had S values greater than 1, with the highest value (2.65) seen in Clostridium perfingens, a widespread bacterium that causes a variety of diseases but is most famous as a ‘flesh-eating bug’. The 80 species examined included variable numbers of representatives from 14 different major lineages (phyla) of bacteria. There was clear phylogenetic clustering of species with high or low S values, but species with strongly selected codon usage bias occurred in several different phyla. Of the 20 species from the gamma proteobacteria (which includes E. coli), nine had S values greater than 1. These nine species form a clade together with a lineage comprised of four species with low S values (figure 1). Thus, it appears that strongly selected codon usage bias evolved on the branch leading to this clade (which includes the orders Enterobacteriales, Pasteurellales, Vibrionales and Alteromonadales) and was subsequently lost on the lineage including Buchnera species and Wigglesworthia. Buchnera species and Wigglesworthia are endosymbionts of insects, which have undergone genome reduction and apparently a general relaxation of genomic selection pressures owing to reduced effective population sizes (Moran & Wernegreen 2000; Wernegreen & Funk 2004); one symptom of this is their long branch lengths in the evolutionary tree (figure 1), reflecting an increased rate of molecular evolution.

Figure 1.

Variation in the strength of selected codon usage bias (S) in gamma proteobacteria. Species are denoted by their genus names, except where there are multiple species from a genus; the abbreviated genus names are Vibrio, Pseudomonas and Xanthomonas. The three Buchnera strains are species infecting different aphid hosts. Shaded ovals next to species names indicate the magnitude of S: white (S < 0.2), grey (0.2 < S < 1.0), black (S > 1.0). Phylogenetic relationships and S values were taken from Sharp et al. (2005). Note that the clustering of Wigglesworthia with Buchnera species may be a phylogenetic artefact (Herbeck et al. 2005); if so, reduced S values evolved independently in the two lineages.

Across the 80 species, values of S were found to be strongly positively correlated with both the number of rRNA operons and the number of tRNA genes in the genome, even after correction for the underlying phylogenetic relationships among species. Many of the species with low S values had only one rRNA operon and a minimal complement of (around 30–40) tRNA genes. In contrast, the E. coli genome has seven rRNA operons and 86 tRNA genes. These results were interpreted as reflecting selection for a co-adapted suite of genomic characteristics required for rapid growth (Sharp et al. 2005). For example, C. perfringens has 10 rRNA operons and 96 tRNA genes and can replicate in only 7 min under ideal conditions (Labbe & Huang 1995).

To test this association with growth rate, we have used minimum generation time data for 76 of these 80 species, drawn from the compilations made by E.P.C. Rocha (Rocha 2004; Coutourier & Rocha 2006). rRNA operon number, tRNA gene number and S values are all strongly negatively correlated with generation time (figure 2). Using independent contrasts to overcome the fact that the data points are linked by an underlying phylogeny (Felsenstein 1985), the correlation coefficients for rRNA, tRNA and S are 0.35, 0.27 and 0.49, respectively, and all are highly significant (p < 0.01). Thus, selection for rapid growth appears to have selected for an increase in the number of rRNA operons and tRNA genes, and for codon usage more strongly biased towards translationally optimal codons.

Figure 2.

Correlations of (a) rRNA operon copy number, (b) tRNA gene copy number and (c) the strength of selected codon usage bias (S), with generation time in bacteria. The minimum generation time (in hours) is plotted on a logarithmic scale.

The observation that closely related species tend to have similar S values (as in figure 1) may reflect similarity of lifestyles, such that closely related bacteria are subject to similar strengths of selection for rapid growth. However, it is also likely that codon usage patterns change relatively slowly. Some of the outlier species in figure 2 could be on lineages that have recently entered a new niche. If a species changed from a lifestyle where rapid growth was advantageous, to one where it was not, it would take some time for strongly selected codon usage bias to decay. That is, the values of S reflect selection on codon usage over a long evolutionary period, but not necessarily the current strength of selection.

4. Variation in the strength of selection on codon usage bias among bacteria

An alternative approach, which aims to estimate the strength of current selection on codon usage bias, is to examine the frequency spectrum of optimal codons across polymorphic sites. In an equilibrium population, assuming an infinite sites model and free recombination among sites, the effect of selection on the frequency spectrum can be predicted (McVean & Charlesworth 1999). In the absence of selection, the distribution is expected to be U-shaped with a mean of 0.5, but as the strength of selection is increased, the distribution becomes skewed towards higher frequencies of optimal codons. Importantly, this distribution is not expected to be influenced by mutation biases (McVean & Charlesworth 1999). The observed distribution of allele frequencies can be compared to those predicted for different values of 2Nes, to obtain the maximum likelihood estimate of this compound parameter, termed gamma (Cutter & Charlesworth 2006). Note that both gamma and S (from the previous section) are estimates of 2Nes, but gamma differs from S in reflecting current, ongoing selection. Cutter & Charlesworth (2006) applied this approach to gene sequences from a eukaryote (Caenorhabditis remanei), and found a strong correlation between estimates of gamma and the strength of codon usage bias reflecting long-term evolution (as summarized here by S). Here, we use a similar approach to analyse bacterial codon usage.

We applied the method to 25 genome sequences of E. coli (including strains of Shigella ‘species’, which lie within the radiation of E. coli). First, we analysed polymorphic codon sites in the same 40 highly expressed genes used to estimate S above. All sites with non-synonymous variation, or more than two alleles, were excluded from the analysis, as were sites where the two alleles were both optimal or both non-optimal codons; the latter included sites encoding Cys and Lys, where no optimal codon was designated (table 1). Among nearly 6000 potentially synonymously variable codons, 194 were segregating for one optimal and one non-optimal codon. The average frequency of optimal codons across the polymorphic sites (qopt) was well in excess of 0.5 and the gamma value was estimated as 1.70 (table 2); this value is not substantially (and certainly not significantly) different from the S value of 1.49 estimated by comparing the codon usage of these 40 genes to that in the genome as a whole.

View this table:
Table 2.

Estimates of the strength of selection for optimal codons from polymorphism data from E. coli and H. pylori.

We then examined 10 genes with low codon usage bias and expressed at low levels. From the 20 chromosomal genes encoding proteins with the lowest recorded copy numbers (between 50 and 100 per cell) in Ishihama et al. (2008), we selected those having codon adaptation index (CAI) values between 0.3 and 0.4. The CAI (Sharp & Li 1987) is a widely used species-specific measure of selected codon usage bias, which would take a maximum value of 1 for a gene using only optimal codons. In E. coli K-12, the range (99th percentile) of CAI values is from 0.15 to 0.74, with a median of 0.31. Thus, the 10 genes selected here do not have the lowest CAI values, but a substantial fraction of genes with lower values are either hypothetical or of likely foreign origin (i.e. owing to horizontal gene transfer). These 10 ‘low’ genes exhibited a much higher level of polymorphism for optimal versus non-optimal codons, consistent with much lower levels of constraint on codon usage in these genes; genes with lower codon usage bias also exhibit higher levels of interspecific divergence at synonymous sites (Sharp et al. 1989; but see also Berg & Martelius 1995; Eyre-Walker & Bulmer 1995). The average frequency of optimal codons among polymorphic sites across the 25 strains was very close to the value of 0.5 expected in the absence of selection and consequently the estimated gamma value was very close to zero (table 2).

In contrast, we analysed seven genome sequences of H. pylori, where codon selection has previously been estimated to be very weak (S = 0.02; Sharp et al. 2005). We focused on the same 40 highly expressed genes as above, where selection (if present) should be strongest. In this species there is a difficulty identifying which codons are optimal, because there is little difference between the codon usage of highly expressed and other genes (Lafay et al. 2000). Therefore, we focused on the four amino acids used to derive S values, where the C-ending codons are expected to be optimal for biochemical reasons, even if they are not preferred because selection is ineffective. In contrast to the analysis of 40 highly expressed genes in E. coli, the average frequency of optimal codons across polymorphic sites was only just greater than 0.5, and the estimate of gamma was not significantly greater than zero (table 2).

Finally, we examined the sequences of five genes determined for 247 strains of C. perfringens (Rooney et al. 2006). Sixteen optimal codons for C. perfringens were defined by the same approach as applied to E. coli in table 1. The five genes vary in their strength of codon usage bias (measured by Fop in table 3), apparently reflecting differing levels of expression. These data are limited in terms of the number of polymorphic sites. Nevertheless, values of Fop and of the average frequency of optimal codons across polymorphic sites showed the same rank order across the five genes (table 3). Similar to E. coli, polymorphic sites in genes with low codon usage bias had average frequencies of optimal codons close to 0.5, yielding gamma values close to zero. The most highly expressed gene in the dataset, rplL, is one of the 40 genes in the highly expressed datasets used above, and seems representative of that dataset because its Fop value is very close to that obtained from the 40 genes as a whole (Fop = 0.647). The estimated gamma value for rplL was 3.28; the value has very wide confidence intervals reflecting the small number of polymorphic sites, but again it is quite close to the S value of 2.65 estimated for this species.

View this table:
Table 3.

Estimates of the strength of selection (gamma) for optimal codons from polymorphism among 247 strains of C. perfringens.

These analyses of the frequency spectrum of optimal codons across polymorphic sites should be taken with caution, since the approach assumes that the sequences are drawn randomly from an interbreeding population at mutation-selection-drift equilibrium (McVean & Charlesworth 1999). It has been shown that a recent change in population size can have an erratic impact on the expected frequencies (Zeng & Charlesworth 2009), but the apparent consistency between the values of 2Nes estimated by gamma and by S suggests that such issues have not been important in the examples considered here. Furthermore, the frequency spectra for the two E. coli datasets analysed here, with 25 sequences and estimated gamma values of 0 and 1.7 (figure 3), appear (qualitatively) remarkably similar to the expected distributions for samples of 20 sequences from a diploid species, with gamma values of 0 and 4, shown by McVean & Charlesworth (1999).

Figure 3.

Distribution of the number of optimal codons at polymorphic sites in 25 strains of E. coli. Data are presented for two sets of genes: 10 genes expressed at low levels (black) and 40 genes expressed at high levels (grey; see also table 2).

The effect of population subdivision on the frequency spectrum, which may be particularly relevant to bacterial species, has not been investigated in detail. In a goodness-of-fit test, the site frequency spectrum for the E. coli low-expression genes differed significantly from that expected (χ2 = 50.1, d.f. = 22, p < 10−3), largely because of the excess of sites with an optimal allele frequency of 7 (figure 3). Such an excess of sites with optimal codons segregating at intermediate frequencies might be expected in samples drawn from a subdivided population. However, the effect is quite small, suggesting again that, for these data, the extent to which the real populations violate the assumptions of the model has had little impact on the results.

The main discrepancy between the observed and expected distributions concerns an excess of sites at extreme optimal codon frequencies in the highly expressed genes; i.e. the leftmost and rightmost grey columns in figure 3 are taller than would be predicted. A possible explanation for the excess of sites with low optimal codon frequencies is that there are certain sites where a codon that is normally optimal is not advantageous. This may be related to the context of the codon: while overall, the frequency of GAA rather than GAG for Glu is only increased a little in highly expressed genes in E. coli (table 1), it has been found that the preference for GAA is strong when the following codon begins with G, but weak in other contexts (Maynard Smith & Smith 1996; Berg & Silva 1997). Also, near the start of highly expressed genes in E. coli, the use of optimal codons is reduced and the frequency of A-ending codons is unusually high, seemingly reflecting conflicting selection pressures (Eyre-Walker & Bulmer 1993). Although we saw no obvious peculiarities to the sites where non-optimal codons were segregating at high frequencies, this merits further investigation.

5. Discussion

The extensive variation in codon usage patterns seen among bacteria is most likely primarily owing to differences in mutation biases. However, in many—but not all—species there is additional variation among genes that is consistent with the action of natural selection. The observation that genes expressed at high levels have increased frequencies of those codons that are expected to be translationally optimal is strongly suggestive that these codons are selectively favoured. The fact that, for some amino acids, the identity of the optimal codons differs among species, coordinated with changes in the population of tRNA genes, reinforces the view that this bias in codon usage is adaptive. However, numerous aspects of how and why natural selection has shaped patterns of codon usage remain unresolved.

To learn more about this we have applied two different approaches to estimate the strength of selection on codon usage in bacteria. Both methods provide estimates of 2Nes, which compounds the effective population size with the selective difference between optimal and non-optimal codons. However, the values (S) from one method reflect very long-term evolution, whereas those (gamma) from the other reflect ongoing selection at polymorphic sites. Gamma values might be expected to be particularly sensitive to the assumption that the sequences analysed came from an idealized population, but the gamma values estimated here were remarkably consistent with estimates of S from the same species. This should not always be the case. Across the phylogeny of bacteria, there have probably been many instances when selection pressures have changed. Then, for example, in a species where codon selection has recently stopped, the gamma value may be close to zero but the S value may still be high because biased codon usage may take a long time to decay (Lawrence & Ochman 1997). In addition, recent demographic changes may impact on the frequency of optimal codons at polymorphic sites and hence gamma (Zeng & Charlesworth 2009), without significant impact on S. At the extreme, Yersinia pestis (the cause of plague) appears to have gone through a very recent severe bottleneck (Achtman et al. 2004) and has such little nucleotide diversity that it would be very difficult to estimate gamma; however, its S value is 1.15 (Sharp et al. 2005), predominantly reflecting selection that occurred in the past in the species from which it was derived, Y. pseudotuberculosis.

Highly expressed genes in many bacteria have S values around 1. The magnitude of the selective difference between optimal and non-optimal codons would then be estimated as on the order of the reciprocal of the effective population size. The effective population sizes for bacterial species are probably not known with any accuracy, but typical values might be in the order of 108 (Lynch 2007), implying that the fitness difference between an optimal and a non-optimal codon may be around 10−8. This is a miniscule value, perhaps reflecting the most subtle form of natural selection known, and only estimable because the selection is repeated over many sites.

This tiny selection coefficient raises a number of issues. One is whether the same form of codon selection can be operating in multicellular eukaryotes. The same approach to estimate S has been adapted for application in eukaryotes (dos Reis & Wernisch 2009). For D. melanogaster and Caenorhabditis elegans, S values of 1.08 and 1.96 were obtained. The same approach to estimate gamma values has been applied to Caenorhabiditis remanei yielding an average value of 0.44 across genes, but with values greater than 1.0 in some genes (Cutter 2008). A number of other analyses have used alternative methods to estimate Nes for codon usage from polymorphism data of Drosophila species. These methods usually require an assignment of the ancestral state at a polymorphic site, which may be difficult in some cases, and especially error prone with bacteria; hence we did not use them here. These analyses have also yielded estimates of the same order of magnitude; for example, Maside et al. (2004) estimated Nes to be around 0.65 in D. americana. Thus, estimates of Nes for Drosophila and Caenorhabditis are of the same order of magnitude as those for bacteria. However, estimates of Ne are typically two orders of magnitude lower than the value given above for bacteria, implying that the fitness difference associated with optimal codons must be two orders of magnitude larger. This has led Lynch (2007) to question whether codon bias in these eukaryotes is caused by some other force, such as biased gene conversion, rather than selection.

A second issue concerns the many sites in the genome where selection on codon usage has occurred. Linkage between sites impairs the efficacy of selection on any one of them, analogous to reducing the effective population size (Hill & Robertson 1966). Bacteria typically have one relatively small chromosome, in which all of the highly expressed genes are linked, and so the strength of selected codon usage bias is expected to be reduced (Li 1987; McVean & Charlesworth 2000). Nevertheless, bacteria have various means of recombination, which vary in frequency among species. This variation in recombination rates could influence the strength of selected codon usage bias, although it has apparently not impacted on H. pylori, which has Nes close to zero, despite perhaps the highest rate of recombination known among bacteria (Suerbaum et al. 1998). If the various sites under codon selection in Drosophila and Caenorhabditis are much less tightly linked than those in bacteria, this could contribute to easing the paradox of similar estimates of Nes in eukaryotes and bacteria (Kaiser & Charlesworth 2009).

The reason why translationally optimal codons are advantageous is also unresolved: it is assumed that they can enhance translational efficiency and/or translational accuracy, but which is more important? The observation that variation among bacteria in the strength of selected codon usage bias is strongly correlated with growth rate (figure 2) may bear on at least one aspect of this debate. In arguing for the accuracy hypothesis, Drummond & Wilke (2008) suggested that non-optimal codons decrease fitness because mistranslated proteins can be toxic. Under this hypothesis, selection against non-optimal codons is stronger in more highly expressed genes because they have more opportunity to be mistranslated. However, it is not clear that the toxic effect of mistranslated proteins would be dependent on rapid growth rate. In contrast, it is obvious that the observed correlation of Nes with growth rate is consistent with the efficiency hypothesis. However, this does not rule out the possibility that inaccuracy of translation is selected against because of its negative impact on the efficiency of translation (Bulmer 1991).

Finally, given the observation that (in many species) codon usage in highly expressed genes is strongly selected and matches tRNA abundance, and yet the identity of the optimal codons can vary among species, there remains an intriguing question: how can this state of co-adaptation between the tRNA gene complement and the codon usage bias in highly expressed genes diverge over time? The observation that the strength of selection varies greatly among contemporary species suggests that there could have been times when ancestral species were subject to relaxed selection, due perhaps to a change of lifestyle or greatly reduced effective population size; selected codon usage bias would then drift and decay. After the re-imposition of selection pressure, the genome could then move to a co-adapted state different from that in the original ancestor. Alternatively, Shields (1990) has suggested that a prolonged influence of mutation bias could provide the impetus for a shift, without the need for a period of drift. Detailed analyses of switches in the identity of optimal codons across the phylogeny of bacteria may provide insights into which, if either, of these processes has played a major role in shaping the patterns of selected codon usage seen in bacteria.


We are indebted to Brian Charlesworth for discussion of various aspects of this topic. This work was supported by a studentship from the UK Biotechnology and Biological Sciences Research Council to L.R.E. and a Royal Society of Edinburgh/Caledonian Research Foundation Biomedical Personal Research Fellowship to K.Z.


View Abstract