Royal Society Publishing

Beyond linear sequence comparisons: the use of genome-level characters for phylogenetic reconstruction

Jeffrey L Boore, Susan I Fuerstenberg

Abstract

The first whole genomes to be compared for phylogenetic inference were those of mitochondria, which provided the first sets of genome-level characters for phylogenetic reconstruction. Most powerful among these characters has been the comparisons of the relative arrangements of genes, which has convincingly resolved numerous branch points, including those that had remained recalcitrant even to very large molecular sequence comparisons. Now the world faces a tsunami of complete nuclear genome sequences. In addition to the tremendous amount of DNA sequence that is becoming available for comparison, there is also a potential for many more genome-level characters to be developed, including the relative positions of introns, the domain structures of proteins, gene family membership, the presence of particular biochemical pathways, aspects of DNA replication or transcription, and many others. These characters can be especially convincing owing to their low likelihood of reverting to a primitive condition or occurring independently in separate lineages, thereby reducing the occurrence of homoplasy. The comparisons of organelle genomes pioneered the way for using such features for phylogenetic reconstructions, and it is almost certainly true, as ever more genomic sequence becomes available, that further use of genome-level characters will play a big role in outlining the relationships among major animal groups.

Keywords:

1. Why do we need anything other than molecular sequence comparisons?

Over the past few decades, the comparison of nucleotide and amino acid sequences has revolutionized our understanding of the evolutionary relationships for many groups of organisms. The broader field of systematics has been reinvigorated and a generation of evolutionary biologists has come to accept that molecular sequence comparisons are an essential component for inferring phylogeny of any group. These studies have led to extensive revision of animal systematics and overturning of previous reliance on the features of the coelom and segmentation (Adoutte et al. 1999).

In the 1980s, when comparing molecular sequences for phylogenetic inference was first becoming common, some asserted with great confidence that all evolutionary relationships would soon be convincingly resolved solely with this type of data, leading to much consternation. However, some of the relationships that were equivocal in early molecular studies have remained highly recalcitrant even with much more DNA sequence data in hand. There are several potential explanations, including: (i) multiple nucleotide or amino acid substitutions may have occurred at a single site, obscuring any accumulated signal; (ii) convergent or parallel substitutions may have occurred among different lineages due to having only 4 (for nucleotides) or 20 (for amino acids) possible character states, exacerbated by convergent biases in base composition (Naylor & Brown 1998), which may even cause ever increasing confidence measures for incorrect associations with ever larger datasets (Phillips et al. 2004); (iii) the analysis may show artefactual association of the more rapidly changing lineages (Felsenstein 1978), including the attraction of long branches to the base of the in-group in association with the out-group (which is almost always a long branch; Philippe & Laurent 1998); (iv) in some cases, non-orthologous gene copies may be inadvertently compared among various lineages due to ancestral gene duplications followed by differential losses, or due to incomplete sampling; (v) differing views of scientists on alignments, exclusion sets and weighting schemes frequently cannot be arbitrated based on objective criteria and can lead to radically different phylogenetic reconstructions and (vi) the most difficult problems are when the time of shared ancestry is short relative to the subsequent time of divergence, where there has been little opportunity to accumulate signal and ample time for it to have been erased.

Molecular sequence comparison is now a mature field that has influenced the culture of systematics. Many have come to expect that the future of systematics will be dominated by creating ever more sophisticated methods for teasing a weak signal from noisy data. This causes concern that differing preferences for various methods will ensure that no consensus on many evolutionary relationships will ever be reached.

However, an alternative is possible, i.e. there may be other, less explored types of characters that could be powerful for resolving these contentious relationships. There is no doubt that comparisons of some characters have identified certain robust synapomorphies (shared and derived character states) that have supported long-standing, little contested evolutionary relationships, such as the monophyly of mammals, tetrapods and echinoderms. These synapomorphies are subjectively judged to be of the characters so unlikely to revert to an earlier condition or to occur multiple times in parallel that they could only have arisen once in the common ancestor of the group. Can new sets of characters be found that would meet these criteria to provide confident resolution of some problematic evolutionary relationships? Although there is a broad range of character types to explore, we focus here specifically on the comparison of features of genomes.

2. Comparisons of mitochondrial genomes have laid the foundation

The sequences from mitochondrial genes and genomes have been used extensively for phylogenetic inference, with complete mtDNA sequences being publicly available for more than 1000 animal species. (For a summary of the characteristics of animal mtDNAs, see Boore (1999).) It has been long argued (e.g. Boore & Brown 1998) that the relative arrangement (normally) of the 37 genes in animal mitochondrial genomes constitutes an especially powerful type of character for phylogenetic inference and so constitutes the first set of genome-level features to be used extensively for animal phylogeny. Briefly summarized, these genes are present in nearly all animal groups, are unambiguously homologous and can potentially be rearranged into an enormous number of states such that convergent rearrangements are very unlikely (and demonstrated to be uncommon). In the cases where it has been studied, all genes on each strand are transcribed together (Clayton 1992), so selection on gene arrangements is expected to be minimal. A summary of the evolutionary relationships convincingly demonstrated by this type of data (and in many cases left unresolved by all other studies) is found in Boore (2006), but here are a few of the more significant conclusions of deep-branch phylogenetic relationships: (i) the superphylum Eutrochozoa includes cestode platyhelminths (von von Nickisch-Rosenegk et al. 2001) and the phylum Phoronida (Helfenbein & Boore 2004); (ii) Sipuncula is closely related to Annelida rather than to Mollusca (Boore & Staton 2002); (iii) Annelida is more closely related to Mollusca than to Arthropoda (Boore & Brown 2000); (iv) Arthropoda is monophyletic and, within this phylum, Crustacea is united with Hexapoda to the exclusion of Myriapoda and Onychophora (Boore et al. 1995, 1998) and (v) Pentastomida is not a phylum, but rather a type of crustacean, and joins with Cephalocarida and Maxillopoda to the exclusion of other major crustacean groups (Lavrov et al. 2004).

3. Nuclear genomes, a treasure trove of phylogenetic characters

By a great margin, more DNA sequence is being generated than ever before. The facilities built and the techniques developed for sequencing the human genome are now focusing on many other organisms. The nine largest genome sequencing centres (table 1) collectively can now produce well over 170 billion nucleotides of DNA sequence per year, which would be approximately 57-fold coverage of the human genome. Imminently, there will be complete genomes of at least draft quality for many dozens of animals representing a phylogenetically diverse sample and including several equivocally placed lineages (figure 1; table 2).

View this table:
Table 1

URLs for the largest public DNA sequencing centres

Figure 1

This reconstruction of the major branches of animal evolution is used to plot the numbers of taxa with complete genome sequences done and underway. The taxonomic ranks shown are arbitrary, split for illustration, but not meant to be consistent among the major groups, and the taxa listed do not comprehensively cover all of life. Branch lengths hold no meaning. While opinions may differ on particular genomes as to whether they are complete versus needing more work, and whether they are well enough along to consider them ‘underway’, it is clear that soon there will be a large and phylogenetically broad sampling of genome sequences.

View this table:
Table 2

Complete nuclear genome sequencing projects done and underway as summarized in figure 1. (Asterisk indicate genomes currently funded to only low coverage.)

In these genomic data are many higher-order features, beyond the linear sequences, that constitute genome-level characters that are potentially useful for phylogenetic reconstruction, including: (i) gene content, including components of multiunit complexes such as the ribosome, splicosome, DNA replication machinery, or oxidative phosphorylation enzymes and the presence versus the absence of particular biochemical pathways (e.g. de Rosa et al. 1999; Fitz-Gibbon & House 1999; Snel et al. 1999, 2005; House & Fitz-Gibbon 2002; Huson & Steel 2004); (ii) the relative arrangements of genes (Boore & Brown 1998); (iii) movements of genes among intracellular compartments (i.e. plastid, mitochondrion, nucleus; e.g. Nugent & Palmer 1991); (iv) insertions of segments of DNA, including transposons and numts (Fukuda et al. 1985; Richly & Leister 2004); (v) variation in intron positions (e.g. Qiu et al. 1998); (vi) secondary structures of rRNAs or tRNAs (e.g. Murrell et al. 2003); (vii) details of genome-level processes, such as the rearrangements that generate antibody diversity (Frieder et al. 2006) and (viii) deviations from the ‘universal’ genetic code (Telford et al. 2000; Santos et al. 2004). Many others are likely to be found.

Of course, the reliability of these features can only be assessed by the study of their consistency with other characters, and several are already suspect. For example, convergent gene losses may be common as organisms independently evolve smaller genomes or no longer experience selection for maintaining a particular biochemical pathway; in contrast, convergent gain of genes seems much less likely. Independent evolution of smaller genomes may also lead to parallel losses of the most expendable structures in the RNA or protein genes. There is a certain time horizon that limits the usefulness of any particular type of character; for example, once retroelements degrade in the sequence beyond the point where the insertion can be reliably inferred to be of single origin, the insertion is no longer useful as a phylogenetic character. Certain changes in the genetic code and in the tRNA secondary structures of mitochondria are known to have occurred convergently (although occasional homoplasy has not disqualified the use of either morphological characters or molecular sequence comparisons). There is also a problem in the case of closely spaced sequential internodes where random partitioning of polymorphisms, including those of genome-level characters, can lead to incorrect inference of phylogeny (e.g. Salem et al. 2003; see Boore (2006) for additional caveats and precautions).

Already there have been important insights gained from comparing such features, including: (i) tarsiers have been shown to be the sister group to the clade of monkeys and apes rather than the prosimians based on the patterns of SINE element integration (Schmitz et al. 2001); (ii) patterns of SINE and LINE insertions have also supported the monophyly of toothed plus baleen whales, that hippopotamuses are the sister group to cetaceans, that camels are the most basal cetartiodactyls (Nikaido et al. 1999), and that river dolphins are paraphyletic (Nikaido et al. 2001); (iii)animal interphylum relationships have been clarified by the comparisons of the gene membership within Hox clusters (de Rosa et al. 1999) and (iv) a study of the presence of spliceosomal introns supports the monophyly of Actinopterygia and clarifies several relationships within the group, including the basal position of bichirs (Venkatesh et al. 1999). For further discussion, see Murphy et al. (2004), Okada et al. (2004) and Boore (2006).

4. What are the advantages of using these genome-level characters?

In general, these types of features would be expected to change in a saltatory, non-clocklike manner. This may seem, at first, to be wrong-headed, since great effort has been expended for many studies to identify clocklike characters, to enable accurate molecular clock estimates of time of divergence. But it is this aspect that makes these genome-level characters especially useful for addressing the most difficult branch points, those with a short time of shared history followed by a long period of divergence, as mentioned above. It is for resolving these relationships that clocklike behaviour guarantees failure, since the ratio of signal to noise will closely match the ratio of the two time periods. Rather it is the least clocklike characters that are expected to prevail, where an occasional and abrupt change may have occurred and then remain (figure 2). Admittedly, the concomitant disadvantage is that, typically, many such characters must be examined in order to find those that happened to have changed during the period of shared ancestry and so marking the relationship (see Boore (2006) for further analysis and discussion).

Figure 2

Illustration of why clocklike characters (a) may be less informative than non-clocklike characters (b) when the internode between the subsequent lineage splits is short. Each of the four shapes is meant to be a character with states indicated by patterning. In (a), the circles and triangles are not informative and the squares and pentagons are homoplasious. The two changes accumulated in the common ancestor of taxa 1 and 2 (for the pentagons and circles), which were at one point synapomorphies, have been erased by the subsequent changes. In (b), the changes are rarer and saltatory. The pentagons and triangles are not informative and the circles are constant, but the squares are informative for uniting taxa 1 and 2.

5. What about clades without representative genome sequences?

This enormous dataset provides a new class of characters that could lead to definitive resolution of some branches of the tree of life, not only for these taxa but also for others where targeted study for identified characters could be fruitful. As shown in figure 1, whole-genome sampling will include many major lineages, but not all. It seems unlikely that there will soon be available a whole-genome sequence of a gastrotrich or a loriciferan, for example. Fortunately, we can use the genomes in hand to identify sets of genome-level characters that can be diagnostic for the relationships of related groups without genome projects. One could, for example, then determine the gene order using Southern hybridization or probe a large DNA insert library (i.e. in BAC or fosmid vectors) to find a clone to sequence for the region of interest of the genome. Gene rearrangements, losses and duplications can also be identified using comparative genomic hybridization (CGH) chips with tiled large-insert clones, as has been done for a sampling of diverse human populations (Sharp et al. 2005) and more broadly across the great apes (Locke et al. 2003) or using the arrays of oligonucleotides (representational oligonucleotide microarray analysis, ROMA; Sebat et al. 2004).

6. What are the main challenges that are before us?

First, we must increase the representation of the understudied groups of animals for large-scale genomic sequencing. There is no reason to believe that taxa that have been traditionally studied intensively, i.e. those with higher species richness, greater breadth of niche occupation, more important roles in pathogenesis or amenability to laboratory experimentation, will be more informative towards the goals of understanding broad patterns of the evolution of animals and their genomes. Second, we need to have a codification of nomenclature for the genes, which is based on the assessment of orthology (Dehal & Boore 2006). The renaming of genes to indicate orthology is not feasible because it would render large bodies of literature difficult to interpret and because scientists who study the model organisms, and who have largely done the naming, are invested in their parochial nomenclature. Thus, the solution must be a lexicon superimposed on these names already in place. Third, a system must be devised for codifying the genome-level characters themselves for entry into the databases and matrices for broad comparisons. Finally, we need for the community to devise the standards of interpretation and analysis, such as the use of cladistic reasoning rather than associating taxa by similarity alone (Boore 2006). Then, it seems probable that the genome-level characters will provide the best dataset for convincingly reconstructing relationships for some of the most hotly contended nodes in the tree of life and establishing a framework for all organismal relationships.

Footnotes

  • One contribution of 17 to a Discussion Meeting Issue ‘Evolution of the animals: a Linnean tercentenary celebration’.

References

View Abstract