Lateral genetic transfer (LGT) is an important adaptive force in evolution, contributing to metabolic, physiological and ecological innovation in most prokaryotes and some eukaryotes. Genomic sequences and other data have begun to illuminate the processes, mechanisms, quantitative extent and impact of LGT in diverse organisms, populations, taxa and environments; deep questions are being posed, and the provisional answers sometimes challenge existing paradigms. At the same time, there is an enhanced appreciation of the imperfections, biases and blind spots in the data and in analytical approaches. Here we identify and consider significant open questions concerning the role of LGT in genome evolution.
Descent with modification has long been accepted as the framework within which the transmission of genetic determinants is best explained, at least in morphologically complex eukaryotes. To be sure, some aspects of eukaryotic biology lie on the edge of this paradigm (e.g. hybridization, epigenetics) or fall outside it altogether (origins of mitochondria and plastids), but these have been viewed as special cases that do not imply any fundamental problem with the standard framework. At first, it was assumed that genetic transmission in prokaryotes is fundamentally vertical too and, indeed, this explains the persistence of phenotypes over observable time, the topological congruence of gene and protein trees and many other observations. Prokaryotes, however, display a number of specialized processes unknown, or of lesser consequence, in plants or animals including transformation, transduction and conjugation, and the transfer of phenotype (e.g. antibiotic resistance or the ability to degrade xenobiotics) among strains is well known. Moreover, some gene trees are topologically discordant with each other and/or vis-à-vis accepted organismal relationships, and in many cases, this discordance is both statistically well supported and apparently recalcitrant to refinement of phylogenetic method. More recently, genomic sequencing has revealed patterns of gene presence and absence that cannot parsimoniously be reconciled with a purely vertical pattern of genetic transmission and gene loss. Most notably, it is now clear that in many prokaryotes, DNA can be acquired from sources other than the parent(s) and become incorporated, more or less stably, into the new genomic context. Here we offer a critical review focusing not so much on what is known about lateral genetic transfer (LGT), but instead on what, more than a decade into the multi-genome era, remain as open issues.
2. Process and mechanism
Genomic DNA of external origin has survived release from its original host, passage through a vector and/or the environment and uptake into the new host cell; it has evaded host defences (e.g. restriction enzymes), recombined into the genome and become prevalent in a population of interest. Each of these steps constitutes a selective filter that can be of greater or lesser consequence in different recipients and under different conditions. The traditional mechanisms of conjugation, transduction and transformation (Thomas & Nielsen 2005) build on a diversity of vectors including plasmids, integrons, phage, prophage-like ‘gene transfer agents’, transposons, retrotransposons, cassette-like chromosomal elements and others that together constitute the so-called mobilome (Frost et al. 2005). These elements have probably been co-opted secondarily into LGT (Redfield 2001). Once inside the new host cell, the novel DNA may or may not replace homologous genetic material. Fluorescence microscopy can reveal LGT (Christensen et al. 1996, 1998) and offers broad potential for characterization of process and mechanism: fluorescent tagging indicates, for example, that most DNA transferred via conjugation into Escherichia coli is recombined into the host chromosome (Babic et al. 2008).
(a) What determines exchange partners?
Successful conjugation or transduction requires vector compatibility between donor and recipient, which often depends on recognition of and interaction with recipient surface proteins (Thomas & Nielsen 2005). Nonetheless, conjugation between distantly related organisms such as bacteria and eukaryotes (plants and fungi) has been demonstrated experimentally (Heinemann & Sprague 1989); Agrobacterium tumefaciens, for example, can be induced to transfer its Ti plasmid to non-plant hosts including fungi and human cells in culture (Lacroix et al. 2006). Plasmids of the self-transmissible IncP-1 group have been isolated from both clinical and environmental (particularly wastewater) settings and harbour a wide range of resistance genes (Schlüter et al. 2007). These plasmids utilize mechanisms that allow their transfer, replication and maintenance in diverse Gram-negative hosts and can mobilize the transfer of other plasmids into an even wider range of target organisms. Under certain selective conditions, plasmids can expand their host range, often via a relatively small number of genetic changes (De Gelder et al. 2008). Broad-host-range phages are known as well: promiscuous immunoglobulin-like domains are found in three families of phages and can attach to a wide range of bacteria including Bacillus, Escherichia, Klebsiella, Lactobacillus and Staphylococcus (Fraser et al. 2006). Bioinformatic predictions of vector specificity are thus far lacking; predicting the interactions between pilus complexes and targeted surface proteins (many of which are uncharacterized) will require sophisticated structural and evolutionary models.
(b) To what extent do different mechanisms contribute to adaptive lateral genetic transfer?
Each step and mode of transfer exposes DNA to different types of modification. Viral DNA is subject to a much higher frequency of mutation (Drake 1991) and homologous and illegitimate recombination (Canchaya et al. 2003) relative to bacterial genomes. Genetic material, including physiologically important genes, carried in viral lineages for many generations can undergo radical changes (Filée et al. 2003; Lindell et al. 2004). The relative contribution of each LGT process is not well understood, but in many cases the processes leave tell-tale clues (Zaneveld et al. 2008). ORFans in bacteria are often enriched in A + T compared with the rest of the genome, suggesting residence in bacteriophage (Daubin & Ochman 2004). Genes on plasmids often show divergent nucleotide composition (Nakamura et al. 2004) and phylogenetic relationships (Gerdes et al. 2000) indicative of xenologous origin. Approximately 10 per cent of known bacterial genomes harbour one or more integrons (Boucher et al. 2007), which appear to be more mobilizable and of broader phyletic distribution than recognized until recently. The presence of transposon or integrase sequences suggests past mobility (Vernikos & Parkhill 2008). A particularly long integrated region (e.g. tens of thousands of base pairs) may constitute evidence for transduction rather than conjugation or transformation, whereas short regions do not distinguish among potential mechanisms because not all introgressed DNA may have been recombined into the host genome, or preserved thereafter. New insights and approaches are required to deal with mechanisms that leave no trace (other than the integrated genetic material), with unknown mechanisms, and with mutational decay of characteristic sequences.
(c) How does environment constrain lateral genetic transfer?
Natural environments (e.g. soil) can potentially be conducive to LGT. Concentrations of free DNA can exceed 1 µg/g (Niemeyer & Gessler 2002; Vlassov et al. 2007), and under some conditions, DNA can persist for millennia (Austin et al. 1997) although its bioactivity may be modulated, e.g. by nucleases or humic acids. Some 90 species of prokaryotes, including some that live in soil or water, are known to be naturally competent (Nielsen et al. 1998; de Vries & Wackernagel 2005). Potential hot spots in the environmental for recombination and LGT include biofilms, the rhizosphere, decomposing material, guts of soil animals and within bacterivorous protozoa. The nature, frequency and determinants of LGT in these complex environments are beginning to be approached via metagenomic sequencing.
It has been assumed that a key limitation on the transfer of genetic material is the need for donor and recipient to be present either in the same habitat, or in two habitats that are linked by a biotic or an abiotic ‘bridge’. Generalist organisms such as Pseudomonas aeruginosa have relatively large genomes and can survive in a wide range of habitats, readily expanding and contracting their genetic content to facilitate new adaptations (Schmidt et al. 1996). A striking example of genomic plasticity is the recent identification in a clinical strain of P. aeruginosa of genes for degradation of secondary metabolites of trees (Mathee et al. 2008). Environmental ranges may have been underestimated: Chlamydiae have recently been discovered inside amoebae and in wastewater (Horn & Wagner 2001) and Rickettsia are viable in amoebae (Ogata et al. 2006), both likely LGT hot spots. Viruses are capable of moving between habitats (Breitbart & Rohwer 2005). Can DNA be exchanged, perhaps inefficiently, across the entire biosphere?
Intracellular bacteria have been thought to be largely unaffected by LGT, due to their isolation from potential donors (Renesto et al. 2005). Prophages, plasmids and transposons (Bordenstein & Reznikoff 2005) are, however, found in many intracellular bacteria including Buchnera, which displays no evidence of LGT. Conjugative plasmids occur in the intracellular parasite Rickettsia felis (Ogata et al. 2005) and may be important in the evolution of virulence (Gillespie et al. 2007). There is evidence for extensive between-strain recombination in the intracellular parasite Chlamydia trachomatis, although many recombined sequences are too long to have been acquired in single transformation events, and no potential vector has been identified (DeMars & Weinfurter 2008). Given that exchange is possible, are the constraints to LGT mainly environmental (e.g. lack of suitable genes, selective pressure for small genome size) or internal (configuration of the reduced cellular network) (Jain et al. 2003)?
(d) What is the physical unit of transfer?
Lateral origins have been proposed for DNA regions ranging from seven nucleotides (Denamur et al. 2000) up to an entire chromosome greater than 3 Mb (Lin et al. 2008) in length and constituting stretches of non-coding DNA, portions of genes, intact genes, multi-gene clusters, operons, plasmids, transposable elements and pathogenicity islands. For this reason, we refer to lateral (or horizontal) genetic, not gene, transfer. As it is doubtful that a Mb-sized region could be successfully transferred except by cell fusion, or integrated all at once into the host regulatory network, these largest regions are presumably built up via a succession of transfers, either primary LGT events targeting a receptive area or selection-driven reshuffling of originally dispersed elements (van der Does & Rep 2007).
Would whole-gene LGT be selectively most advantageous? In the immediate term, especially where the introgressed gene does not require efficient regulation and the protein is functional without having to be part of a complex, whole- and multi-gene LGT offer immediate functionality (e.g. antibiotic resistance). On the other hand, significant up-front costs to competitive fitness might be incurred before amelioration and recruitment of regulators eventually allow efficient stoichiometric participation in cellular networks; these costs, however, might be lessened by the silencing or transcriptional down-regulation of incoming genes, whether actively by the H-NS system in Gram-negative bacteria (Dorman 2007) or passively by regulator and codon-profile mismatch.
Homologous replacement of part of a gene (within-gene LGT) might avoid some of these problems but raises others, including possible disruption of the folding, catalysis or interactions of the chimaeric gene product. Nonetheless, some examples have been identified in prokaryotes, including within the rrnB operon (Yap et al. 1999) and the mutU and mutS genes (Denamur et al. 2000); many others are implicit from genome-wide scans of recombination breakpoints (Mau et al. 2006). Of two instances of LGT within genes encoding EF-1α (Inagaki et al. 2006), one preserves existing domains but the other interrupts domains. Are domains (sometimes) the unit of LGT? Chan et al. (2009) found that 25.0 per cent of 1462 sets of orthologous genes exhibit at least one within-gene recombination breakpoint. These breakpoints are not uniformly distributed among gene sets, but preferentially occur in sets with annotated SCOP domains and often interrupt these domains. Thus, DNA regions that encode full-length protein domains (which we might term domons) are not privileged units of LGT. Many genes have mixed heritage, and greater precision is needed in describing them as vertical or lateral, concordant or discordant.
There is a broad consensus that LGT has contributed to genome evolution in prokaryotes (and to some extent in eukaryotes: see below). There is much less agreement on its quantitative extent: estimates range from close to zero, to more than one LGT event per gene per genome. At what point do lateral signatures obscure the vertical, moving LGT from an important but secondary player to the centre stage of genome evolution?
(a) How should we think about and express the extent of lateral genetic transfer?
In the phylogenetic approach, each instance of topological discordance between a gene tree and a trusted reference tree is taken as a prima facie instance of LGT. Discordance can be found throughout the entire range of nodal depths within these trees, from recent (genera, species) to older, presumably reflecting a commerce in genetic material that has been ongoing since pre-genomic times (Woese 2000). Viewed in this way, every genome has LGT in its ancestry. This very situation, however, implies that whatever meaning there may be in the concept genome phylogeny does not extend simply to each constituent gene or region. If we instead ask what proportion of genes (or other analysable units) in a genome exhibit, back to the deepest defined node, a path that is discordant with a reference hypothesis (e.g. a reference topology, or monophyly of a particular taxon), the number can be 25–50% or more (Nelson et al. 1999; Kunin & Ouzounis 2003; Creevey et al. 2004; Lerat et al. 2005; Zhaxybayeva et al. 2006; Dagan & Martin 2007; Shi & Falkowski 2008). But this approach gives multiple votes to basal edges, which lie on many paths and thus may be traversed multiple times, without necessarily taking into account the concordance of subtrees and, in most implementations, the support for individual edges. In an extreme case, all genes descending from a single ancient transfer event might be counted as lateral, even if every subtended node and edge agrees perfectly with the reference tree.
Thus, a third approach is to focus on the proportions of recognizable genetic transmission steps (edges) that are vertical or horizontal, i.e. on the proportion of well-supported bipartitions that are discordant with a reference hypothesis. Beiko et al. (2005) identified 13.4 per cent of nearly 100 000 bipartitions, over 144 prokaryotic genomes, as concordant at posterior probability (PP) ≥ 0.95 with a reference supertree; analysing a 165-genome dataset, Kunin et al. (2005) found 4.7–5.2% of events to be LGT, 11.1–11.6% gene losses and 83.4–83.6% vertical transfers. At an even finer level of resolution, each edge represents many millions of organismal generations but, even so, many internal edges remain consistent with reference tree: thus instantaneously, vertical events (which, however, cannot be distinguished from LGT among very close relatives) are vastly more likely to persist than are lateral transfers.
Each of these approaches addresses a distinct question and yields a distinct perspective on the contribution of LGT to genome evolution. More generally, we might ask whether it is better to count (inferred) transfers or to model transfer. A quantitative model would attempt to account for events that are not empirically detectable by comparative approaches (e.g. due to their transitory existence, near-identity of sequence or obliteration by subsequent events) or fall short of a defined statistical threshold. Promising steps have been taken in Bayesian (Suchard 2005) and likelihood (Galtier 2007) frameworks; as further genomes are sequenced, it will be possible to estimate parameter values for models of increasing complexity.
(b) Can phylogenetic and non-phylogenetic approaches provide complementary evidence?
Diverse problems can afflict tree-based approaches: non-robust or inconsistent methods, model violation, lack of convergence in Markov chain Monte Carlo sampling, paralogy, ancient lineage-sorting, inadequate taxon sampling and others. Over 22 432 gene trees, Beiko et al. (2005) showed that the underlying data offer the same statistical support to instances of topological incongruence (i.e. LGT) as to congruence (vertical descent), but deeper systematic problems may remain hidden. Non-phylogenetic (surrogate) methods may complement the phylogenetic approach: a strongly G+C-biased gene, for example, might violate phylogenetic models but be readily detected by its anomalous composition. Surrogate methods differentially detect LGT of different relative ages (Ragan et al. 2006) and could be chosen specifically to complement weakly supported nodes. The applicability of surrogate data may be case specific: gene (DNA tract) order is much better conserved in Yersinia (Darling et al. 2008), for example, than in Escherichia. But in general, does the union of all prediction sets represent the best estimate of total LGT (Lawrence & Ochman 2002; Ragan 2002)?
Flanking phage integrase genes or integration motifs are often taken as ‘smoking gun’ evidence of LGT in prokaryote genomes. Cho & Palmer (1999) made an analogous argument for co-conversion tracts flanking a group I intron in angiosperm mitochondria. Repeating the analysis with more-extensive taxon sampling, Cusimano et al. (2008) concluded that intron distribution is best explained by vertical transmission plus multiple losses and, hence in this case at least, co-conversion tract footprints are unreliable markers for LGT.
(c) How much lateral gene transfer is undetectable?
Methodological approaches are differentially blind to LGT. Methods based on atypical composition or hidden Markov models may not detect ancient, well-ameliorated transfers (Ragan et al. 2006); gene distributions brought about by LGT may be mistakenly attributed to gene loss (and vice versa); transfers between sister leaves are phylogenetically silent. Poptsova & Gogarten (2007) compared three tests used in conjunction with phylogenetic approaches: the ‘approximately unbiased’ (AU) test, the Robinson–Foulds symmetric distance and bipartition spectra. For in silico ‘transfers’ superimposed on the terminal branches of a γ-proteobacterial phylogeny, bipartition spectra failed to detect only 3 per cent and 6 per cent of transfers at 70 per cent and 90 per cent cut-offs, while generating fewer than 4 per cent and 2 per cent false positives, respectively. The AU test was less sensitive (and requires knowledge of the true tree), whereas at 97.5 per cent cut-off the Robinson–Foulds distance detected only 58 per cent of reciprocal exchanges and 60 per cent of orthologous replacements. The inability of most approaches to detect transfers among sister leaves is particularly problematic, as diverging strains are exactly the ones most likely to live in similar environments, posses similar gene contents and genomic tracts amenable to homologous recombination, and harbour the same vectors; on the other hand, most LGT between sisters may be transitory ‘churn’ (futile turnover) that is best ignored in most contexts.
(d) Can aggregate methods provide a reliable reference tree?
By integrating over diverse genome regions and cellular functions, aggregate approaches may avoid biases inherent in simple hypotheses or individual approaches, particularly if data giving rise to discordant signals are first removed. Aggregation might be accomplished via a supermatrix of concatenated sequence data or a supertree computed from well-supported features of individual gene trees, or be captured in an overall distance measure based on gene content, order or pairwise similarity. But is the resulting aggregate signal that of a vertical phylogenomic core? Two objections have been raised: that this is philosophically the wrong view of genome evolution and that vertical signal may be obscured by LGT. The former is considered in two other papers in this issue (Doolittle 2009; Fournier et al. 2009); the latter objection can be approached via computational simulation. Ge et al. (2005) found that even at several LGT events per gene family, the vertical backbone can be identified clearly. Based on more-extensive simulations, Beiko et al. (2008) show that vertical signal is preserved even at high LGT frequencies except in the presence of biases (e.g. habitat-directed bias) that favour exchange among more distantly related genomes. When such biases are present, aggregate histories reflect neither the true vertical history nor the lateral alternative. Supertree methods may be less susceptible to some LGT biases (Gribaldo & Brochier-Armanet 2006).
(e) Can the biological sources of transfers be identified?
LGT can be modelled as a break-and-reanneal operation on a graph. A gene (together with the subtree of its descendants) is broken off the reference (genomic or organismal) tree and reannealed at a topologically different position on its gene tree; the initial breakage point thus represents its ultimate biological source. This same point, however, also represents all the extinct, undiscovered and un-sequenced lineages that would, if known, descend from that edge (Gogarten & Townsend 2005), so the reference tree itself may tell us little about the biology of the primary donor. Where multiple break-and-reanneal steps are required, a shortest path through intermediate host lineages can often be inferred (Beiko & Hamilton 2006), but there is no guarantee that LGT has taken the shortest computational path: the intermediates may likewise be extinct, unknown or un-sequenced, or the path may be ecologically improbable. Kunin et al. (2005) propose that ‘hubs’ in LGT networks serve as ‘gene banks’ that acquire and redistribute genes within microbial communities; as above, these hubs too may represent extinct, unknown and un-sequenced organisms.
Composition-based methods offer even lower resolution. As described above, DNA of lateral origin has passed filters that can skew its composition. In most γ-proteobacteria, ORFans (genes without significant matches in current databases) are A+T-rich, suggesting a phage ancestry (Daubin et al. 2003; Daubin & Ochman 2004). Successive LGT events can superimpose DNA onto previously incorporated regions, resulting in a genomic pastiche that may defy analysis; Chan et al. (2009) found that 5.5 per cent of 1462 orthologue families exhibit complex recombination patterns consistent with successive layering of LGT. Gene content can be useful: a gene showing sparse, anomalous presence in one lineage might be inferred to have been transferred from a group in which it is ubiquitous.
If LGT has been frequent, then physiology (hence niche) has presumably changed repeatedly over time. A probable example is the plant pathogen Erwinia carotovora subsp. atroseptica, which shares the common enterobacterial backbone but has gained many specializations for its plant-pathogenic lifestyle via LGT from other bacteria living ‘in and around plants’ (Toth et al. 2006).
(f) Is lateral gene transfer less frequent among eukaryotes?
DNA appears to be readily transferred from intracellular prokaryotes to the eukaryotic host nucleus: 18 per cent of Arabidopsis nuclear genes may be of cyanobacterial, presumably proto-plastid, origin (Martin et al. 2002) and much of Wolbachia is found in its arthropod and nematode host nuclei. In Drosophila, 2 per cent of the transferred Wolbachia genes are expressed (Dunning Hotopp et al. 2007). Plant-associated nematodes appear to have acquired cellulases and pectinases from plant-associated bacteria (Mitreva et al. 2005), while genes of probable prokaryotic origin are responsible for cellulose synthesis (Nakashima et al. 2004), starch degradation (Da Lage et al. 2007), shikimate biosynthesis (Richards et al. 2006) and detoxification (Burroughs et al. 2006) in various animals. Gene flow in the opposite direction appears to be less common, although possible examples have been put forward (Ponting et al. 1999; Jenkins et al. 2002; Chen et al. 2007).
LGT is well documented among fungi (Oliver & Solomon 2008). For example, an 11-kb DNA region that includes a gene (ToxA) for toxin production and a transposase has moved from the nuclear genome of one wheat pathogen, Pyrenophora tritici-repentis, to that of another, Stagonospora nodorum (Friesen et al. 2006). Plant mitochondria, although not plastids, exchange genes across species (Richardson & Palmer 2007) and might serve as portals into plant nuclear genomes. On the other hand, LGT is infrequent in yeasts and animals (Andersson 2005), although Burghoff et al. (2008) provide evidence for transfer of the enhanced green fluorescent protein (EGFP) marker from human endothelial cells to rat cardiomyocytes via apoptotic bodies after intracoronary transplantation; they found no evidence for nanotubular connections between cells, nor for a viral vector. It is not known whether the low frequency of LGT among animals is primarily physical (protection of the germ line from intracellular bacteria) or regulatory (difficulty in integrating into a complex regulatory network, e.g. involving non-coding RNAs).
Most eukaryotic diversity lies among protists. Relatively few protist nuclear genomes have been sequenced, but limited data suggest that phagotrophic protists may have LGT frequencies comparable to those of bacteria (Richards et al. 2003; Andersson 2005).
It is well known that genes encoding uptake, transport and metabolic functions (which can convey selective advantage in certain environments) have often been transferred laterally: examples include bacterial transporters (Gelfand & Rodionov 2008), nitrogen fixation (Raymond et al. 2004), type III secretory systems (Tobe et al. 2006) and many others. But no category of gene is immune to LGT: even informational genes involved in DNA mismatch repair (Lin et al. 2007) and translation elongation (Inagaki et al. 2006) show strong evidence for transfer. Like gene duplication, LGT can provide fodder for neofunctionalisation: for instance, the type IV secretion system in Bartonella appears to have been acquired via LGT, following which its function changed from conjugation to erythrocyte adherence (Nystedt et al. 2008). With the help of mobile genetic elements, novel pathways can be assembled by LGT (Springael & Top 2004).
(a) How long do laterally transferred sequences persist in a genome?
Individual genome sequences are snapshots of evolution: some genes are well integrated and will presumably remain in the lineage for a long term, while others are transient and unlikely to persist. The highest rates of gene gain are found at the leaves of the tree, implying that many genes gained earlier were subsequently lost (Berg & Kurland 2002; Hao & Golding 2006). Not all lateral genes, however, turn over rapidly; in the γ-proteobacteria, for example, some persist vertically for great lengths of time and even become characteristic of the group (Lerat et al. 2003). At least in E. coli/Shigella, genes acquired from distantly related bacterial groups are less likely to persist than ORFans (von Passel et al. 2008). Eukaryotic groups too may be defined by lateral genes that were fixed in a common ancestor: red algae and green plants, for instance, uniquely among eukaryotes share TOP6B, a gene of archaeal origin (Huang & Gogarten 2006). Lima et al. (2008) classified genomic islands of Xanthomonas into two categories based on atypicality of nucleotide compositions; deviant islands, which potentially have been acquired more recently, are rich in mobile genetic elements, whereas islands of typical (ameliorated?) composition harbour many more genes involved in metabolism and cellular processes.
(b) How are lateral genes connected to cellular networks?
The workings of the cell can be abstracted as a network in which molecules (genes or proteins) are represented as nodes (vertices) and physical interactions as edges. Given the state of knowledge, these representations contain false-positive and false-negative edges and ignore spatial and temporal details. Nonetheless, important generalizations emerge: (i) cellular networks contain both highly and weakly connected nodes; (ii) a species-specific core subset of nodes is present in all strains of a species, while other nodes may be found in most, some or only a few strains; (iii) core nodes are chromosomal, whereas peripheral nodes may be encoded on the chromosome, islands or plasmids; (iv) core nodes typically describe functional units, e.g. operons or macromolecular complexes; (v) core nodes tend to be more highly connected and more highly expressed than peripheral nodes; (vi) core-node genes accumulate point mutations more slowly than do peripheral-node genes; (vii) peripheral nodes are more often implicated in functions that are directly affected by the environment and, at least in human, the periphery of the protein-interaction network maps approximately to the cellular periphery; and (viii) networks evolve by the addition of peripheral nodes (Pál et al. 2005; Kim et al. 2007; Lercher & Pál 2008; Wellner et al. 2007; Davids & Zhang 2008). These features of biological networks contribute to their evolvability (Oikonomou & Cluzel 2006).
Because they are highly connected, require tighter stoichiometric and expression control, are less-exposed to immediate selective pressure or are evolutionarily conservative, core-node genes are less susceptible to homologous replacement via LGT than their peripheral counterparts (Jain et al. 1999; Papp et al. 2003; Aris-Brosou 2005). At least, this appears true for lateral genes identified by compositional bias, i.e. those more recently arrived in the genome, although not for those identified by tree comparison (Wellner et al. 2007). Under a generous definition, the Streptococcus core genome is about 18 per cent recombinant while the Streptococcus pyogenes core may be 35 per cent recombinant. Thus, lateral genes that survive the initial ‘churn’ can slowly become better integrated into host-cell networks, e.g. by recruiting transcriptional regulators (Navarre et al. 2007; Wellner et al. 2007; Lercher & Pál 2008), recognizing transcription, translation, folding and assembly signals (Lercher & Pál 2008), and otherwise accommodating kinetically and thermodynamically. Most global regulators have evolved vertically, whereas many local regulatory nodes have been acquired by LGT, often simultaneously with the gene(s) they regulate (Price et al. 2008). Interestingly, in both E. coli and yeast, lateral genes preferentially interact with core nodes (Eisenberg & Levanon 2003; Davids & Zhang 2008).
Network remodelling can occur on large scale, e.g. in invasion of new niche: lactobacilli making the transition to life in a nutritionally rich medium lose 600–1200 genes and gain almost 100 others perhaps by LGT (Makarova et al. 2006). Except in the case of extreme genomic reduction, cellular networks remain evolvable, yet robust to the vicissitudes of short-term LGT.
(c) Do all prokaryotes have a stable core genome?
The concept of a core genome (as in the preceding section) is interrelated with the delineation of groups of organisms. Groups can be delineated, perhaps hierarchically, based on sequence hybridization or identity (Stackebrandt & Goebel 1994), population-level processes (Cohan 2002) or a hybrid of genomics and ecology (Staley 2006). A set of universally (or nearly universally) conserved genes may define each group. Among eight Streptococcus agalactiae genomes, the core was estimated to be approximately 90 per cent as large as the smallest individual genome (Tettelin et al. 2005), whereas in E. coli, the core genome of the first three sequenced strains was about 70 per cent the size of E. coli K12, demonstrating much higher variation in gene content (Welch et al. 2002). In subsequent analysis of 20 sequenced E. coli strains, the use of a slightly relaxed core criterion drastically increases the estimated size of the core (Charlebois & Doolittle 2004; Konstantinidis et al. 2006). The relationship between strict and ancestral cores should be explored further, to understand why genes sufficiently important to remain in most organisms can be lost in a small subset.
One (perhaps the dominant) genomic component of a species is its ‘clonal frame’, ancestrally inherited genetic material with conserved linkage. This clonal frame may, however, be progressively disrupted as genes are introduced via LGT from outside the clonal population and through other events such as gene loss (Milkman & Bridges 1990). Different species and populations show different levels of clonality: Salmonella enterica, for example, shows high levels of clonality and linkage disequilibrium, whereas Neisseria gonorrheae exhibits a much higher rate of mixing and almost no clonal frame (Sriramulu 2008). Linkage equilibrium has been observed in natural populations such as those of Halorubrum (Papke et al. 2004). The core can also include genes that were introduced more recently than the common species ancestor and have spread through the population via LGT; these components should exhibit different evolutionary histories relative to the clonal frame and may not show conserved linkage relative to the clonal frame in different strains or isolates. Most analyses of clonal frame have sampled a few housekeeping loci from many isolates of a given species rather than surveying entire genomes. Such analyses have revealed, for instance, contrasts in the population structures of environmental versus pathogenic members of a species (e.g. Bisharat et al. 2007). Empirical studies of the stability of clonal frames within or outside the taxonomic group of interest, and of the rate and mode of clonal frame decay at different phylogenetic depths, are, however, thus far lacking. The existence of a clonal frame might provide a strong hypothesis for organismal relationships among members of a species, while acquired core genes may point to functions that are sufficiently valuable to have become fixed.
(d) What factors determine the size and composition of pan-genomes?
Sonea & Panisset (1983) envisioned all prokaryotes as forming a global gene pool, from within which different subsets of genes come together in unlimited, if temporary, combinations. This concept has been refined to the idea of the pan-genome, the entire repertoire of genetic information in all genomes of a given species (Tettelin et al. 2005). Estimates of pan-genome size are affected by non-random strain sampling and extrapolation effects, but in some cases reach an order of magnitude greater than the size of any individual genome (Medini et al. 2005). In other cases (e.g. Bacillus anthracis), the pan-genome appears closed, and a few genomes yield a nearly complete inventory. Habitat range appears to be important in determining pan-genome size, with the broad-range S. agalactiae having a much larger pan-genome than S. pyogenes, primarily found in nasal mucosa (Lefébure & Stanhope 2007). Susceptibility to different mechanisms of LGT may also play a role in determining pan-genome size.
Genes that have recently joined the accessory genome may reflect current and recent environmental stresses experienced by different isolates within a species (Thomason & Read 2006). This hypothesis is borne out by many analyses of core versus accessory genomes. A comparative analysis of 12 strains of Prochlorococcus marinus revealed conserved core genes, sets of genes that are ubiquitous in and exclusive to high-light- and low-light-adapted subsets of the group, plus accessory genes specific to one or a few strains (Kettler et al. 2007); strain-restricted functions include viral defence and transport of toxins. Association of adaptive genes by ecotype or lifestyle is compelling, but candidate virulence genes in Neisseria (Schoen et al. 2008) and Streptococcus (McMillan et al. 2006) do not preferentially associate with virulent strains, suggesting that sequence variation can be important. Eppinger et al. (2004) identified a ‘flexible gene pool’ for four Campylobacter species comprising plasmids, phage elements, mobile DNA and genomic islets and DNA-restriction/modification systems. Elucidating pan-genome function and possible modularity of different components of the pan-genome will be difficult, particularly since many non-core genes are conserved hypotheticals and ORFans. Functional predictions will need to be complemented with mutagenesis and gene deletion experiments, to assess the contribution of different pan-genomic elements to organism and community function.
(e) Can we reconstruct ancestral physiologies?
Dagan & Martin (2007) used a parsimony approach to infer ancestral genome sizes under different frequencies of LGT, observing that an average of 1.1 LGT events per genome family yielded ancestral genomes of a similar size distribution to modern ones. However, if larger genomes with generalist lifestyles are more likely to produce descendants that persist in the long term, then ancestral genomes may have been larger, on average, than modern ones and LGT frequencies of 0.5–1.0 per gene family may not be unrealistic. This in turn might be counterbalanced to some extent if ancient enzymes were, on average, more broadly versatile than their modern descendants (Cairns-Smith & Walker 1974; Yčas 1974).
Beyond the question of ancestral genome sizes, ancestral reconstructions of gene content and metabolic potential are useful in understanding the evolutionary pathways followed by organisms and inferring the features of ancient habitats. Parsimony analyses implicitly generate ancestral genomes at the internal nodes of a tree or network (Ouzounis 2005). Using reference genome trees and phylogenetic profiles, Shi & Falkowski (2008) extrapolated that the last common ancestor of cyanobacteria was incapable of nitrogen fixation and was likely thermophilic. Core informational and photosynthetic functions were probably present in this ancestor, while some peripheral components of the photosynthetic apparatus may have been absent. Mapping core genes into an inferred common ancestor is parsimonious, but like present-day prokaryotes this ancestor presumably contained variable-shell genes that have not persisted. In general, the accuracy with which we can reconstruct an ancestral genome will be affected by the distribution of duplication, loss and LGT events as well as by the completeness of descendant taxon sampling. Biased gene loss or gene uptake (e.g. niche specialization or colonization of new habitats) is likely to confound reconstruction: it is unlikely that the genome of the last free-living ancestor of endosymbionts such as Buchnera and Wigglesworthia can be inferred from present-day symbiont genomes. Reconstructed ancestral genomes must also be coherent biochemically and energetically: a patchwork of fragmentary pathways is unlikely to be accurate. Finally, the underlying notion of an ancestral genome loses meaning if LGT has been sufficiently frequent or the inferred ancestor sufficiently ancient: in such cases, the modern genome will represent an amalgam of multiple ancestral lineages, perhaps without preference for the genome of the cellular ancestor.
Great progress has been made in understanding LGT as a lifestyle for populations and species, but key issues—including those identified above—remain open. We offer this brief overview in hopes of stimulating the formulation of hypotheses that can be informed by new data, are susceptible to experiment or can be sharpened by mathematical modelling. A deeper understanding of the dynamics of genome evolution, including LGT, will inform scientific and public debate on a number of critical issues including biodiversity, global climate change and the engineering of biological systems.
We acknowledge the support of Australian Research Council (grant CE0348221) to M.A.R. and from the Canada Research Chairs program, Genome Atlantic and the Natural Sciences and Engineering Research Council to R.G.B.
One contribution of 11 to a Theme Issue ‘The network of life: genome beginnings and evolution’.
- © 2009 The Royal Society