The origin of the eukaryotic cell is considered one of the major evolutionary transitions in the history of life. Current evidence strongly supports a scenario of eukaryotic origin in which two prokaryotes, an archaebacterial host and an α-proteobacterium (the free-living ancestor of the mitochondrion), entered a stable symbiotic relationship. The establishment of this relationship was associated with a process of chimerization, whereby a large number of genes from the α-proteobacterial symbiont were transferred to the host nucleus. A general framework allowing the conceptualization of eukaryogenesis from a genomic perspective has long been lacking. Recent studies suggest that the origins of several archaebacterial phyla were coincident with massive imports of eubacterial genes. Although this does not indicate that these phyla originated through the same process that led to the origin of Eukaryota, it suggests that Archaebacteria might have had a general propensity to integrate into their genomes large amounts of eubacterial DNA. We suggest that this propensity provides a framework in which eukaryogenesis can be understood and studied in the light of archaebacterial ecology. We applied a recently developed supertree method to a genomic dataset composed of 392 eubacterial and 51 archaebacterial genera to test whether large numbers of genes flowing from Eubacteria are indeed coincident with the origin of major archaebacterial clades. In addition, we identified two potential large-scale transfers of uncertain directionality at the base of the archaebacterial tree. Our results are consistent with previous findings and seem to indicate that eubacterial gene imports (particularly from δ-Proteobacteria, Clostridia and Actinobacteria) were an important factor in archaebacterial history. Archaebacteria seem to have long relied on Eubacteria as a source of genetic diversity, and while the precise mechanism that allowed these imports is unknown, we suggest that our results support the view that processes comparable to those through which eukaryotes emerged might have been common in archaebacterial history.
Over the past 100 years, a multitude of hypotheses have been proposed to explain eukaryogenesis. These hypotheses can be considered as variants of two main models, autogenous and symbiotic. The autogenous model proposes that all eukaryotic membranes and their membrane-bound organelles (mitochondria and plastids) evolved through a process of compartmentalization and infolding of plasma membranes [1,2]. However, the results of empirical studies based on phylogenetics [3–9], cell biology [10,11], bioenergetics , as well as considerations of the Archaean fossil record [13,14] and the absence of primitively amitochondriate eukaryotes  overwhelmingly support a symbiotic origin, where the mitochondria and the plastids are the descendants of free-living organisms, and did not evolve autogenously (reviewed in ). Of the many symbiotic scenarios that have been proposed (e.g. [16–20]; see  for a recent review), current evidence favours a single endosymbiotic event in which the ancestor of the mitochondrion (an α-proteobacterium) and the host cell (an archaebacterium) merged to become the first eukaryote. This hypothesis, generally referred to as the ‘ring of life’ hypothesis , has its roots in in the eocyte hypothesis that was first introduced by Lake [3,22], who defined the unknown archaebacterial sister group of the eukaryotes as the ‘eocyte’. Initially, phylogenetic analyses suggested that the eocyte was most likely the sister group of the Crenarchaeota . However, the most recent and sophisticated studies carried out to address this problem point towards the Thaumarchaeota, Aigarchaeota, Crenarchaeota, Korarchaeota group  as the most likely closest relative of the eocyte [8,9,24]. Under this well-supported scenario, the emergence of the first eukaryote must have post-dated the origins and initial radiations of both the α-Proteobacteria and the Archaebacteria. As a consequence, and despite the radically different cellular organization and subsequent ecological success of the eukaryotes, Eukaryota is younger than Archaebacteria and Eubacteria, and thus it cannot have been one of the primary lineages of life [8,14].
The symbiotic hypothesis for the origin of the eukaryotes implies that at least one extinct archaebacterium (the eocyte) had phagocytic abilities and could integrate the genome of another prokaryote to establish a stable symbiotic relationship. The greatest perceived weakness of the symbiotic hypothesis is that the ability to engulf another prokaryote is unknown in Archaebacteria and has only been reported in a eubacterium (a β-proteobacterium [24,25]). Thus, the symbiotic theory has sometimes been referred to as the ‘fateful encounter’ hypothesis  because it seems to depend upon a rare and improbable event. Here, we ask whether there is evidence for an alternative view that ancestral Archaebacteria could have been broadly capable of engaging in processes of phagocytosis, cell fusions and foreign-genome integration, all of which were likely prerequisites to the establishment of stable symbiosis.
2. The eocyte had the potential to enter into a relationship of symbiosis
Studies of archaebacterial genomes have recently demonstrated the presence of actin-like proteins in Archaebacteria [27–29]. These proteins are related to those found in Eukaryota and could have allowed ancestral Archaebacteria to create branched filamentous structures and networks that could have facilitated particle engulfment [26,27,30]. An argument that was frequently used against an eocyte ancestry of the eukaryotes is that archaebacterial membranes use glycerol-1-phosphate lipids, while eukaryotic and eubacterial membranes use glycerol-3-phosphate lipids, and that the evolution of eukaryotic membranes through intermediates composed of both lipids would have been ‘selectively disfavoured’ . Yet, recent experiments have shown that heterochiral hybrid membranes consisting of a mixture of glycerol-1- and glycerol-3-phosphate lipids can be synthesized and are stable . Archaebacteria with eubacterial ectosymbionts have been discovered , and more recently, it has been shown that at least some archaebacterial species (Haloferax volcanii and H. mediterranei) can engage in processes of cell fusion that have as a consequence the generation of recombinant heterodiploid chromosomes [34,35]. Lastly, the Lokiarchaeota, an archaebacterial phylum with sophisticated membrane remodelling capabilities and possessing a multitude of proteins that in eukaryotes are involved in phagocytosis, has recently been discovered . Overall, this evidence suggests that, in principle, ancient Archaebacteria could have been capable of engulfing other prokaryotic cells, establishing stable symbiotic relationships with them, and integrating the foreign genomes with their own. What is unclear is how frequently Archaebacteria were involved in the above-mentioned processes. If these processes were frequent, eukaryogenesis would have just been an accident waiting to happen: a consequence of archaebacterial ecology.
3. Evidence for ancient gene flows and genome chimerization in Archaebacteria
Nelson-Sathi et al. [36,37] recently presented results suggesting that the emergence of several extant archaebacterial lineages correlates with several large inflows of genes acquired through massive, horizontal gene transfers (HGTs) from eubacterial donors (i.e. imports). These imports (from Eubacteria to specific archaebacterial ancestors) were massive and may constitute signatures of ancient chromosomal recombination events. In the case of the Haloarchaea, Nelson-Sathi et al. [36,37] concluded that these eubacterial genes were mainly imported from Actinobacteria. However, for other archaebacterial groups (Thermoproteales, Desulfurococcales, Methanobacteriales, Methanococcales and Methanosarcinales), the origins of which seem to have been preceded by extensive imports of eubacterial genes , a specific donor lineage could not be defined. This might be because HGT-based prokaryotic recombination, as opposed to sex-based eukaryotic recombination, leads to chimeric pangenomes where individual genes frequently have different phylogenetic histories. That is, the eubacterial partners in these putative, ancient, hybridization events would have been chimerical organisms to start with [38,39].
If the results of Nelson-Sathi and co-workers could be confirmed, their impact on our understanding of eukaryogenesis would be dramatic, as we should conclude that large-scale gene flows from the Eubacteria were common in archaebacterial history. This would provide a general framework for understanding eukaryogenesis in the context of archaebacterial ecology. Accordingly, while eukaryogenesis will still be a momentous singular event in the history of life, we would now be able to understand and explain it as a consequence of archaebacterial ecology.
4. Using supertrees to test hypotheses of symbiogenesis and large-scale genes flows
Supertree methods are general tools that can be used to amalgamate trees on overlapping leaf sets, with the standard consensus methods, e.g. the majority-rule consensus method , representing special cases where all the input trees have the same leaf set .
Supertree methods can be used in genomics to combine partially overlapping gene trees to make inferences about the species phylogeny and/or to investigate patterns of congruence and incongruence between realized supertrees and specific sets of gene trees. The latter has been used previously to test hypotheses of eukaryogenesis and to demonstrate the chimeric nature of eukaryotic genomes . Using this approach, Pisani and co-workers were able to find genome-wide evidence for evolutionary relationships between chloroplasts and the Cyanobacteria, mitochondria and the α-Proteobacteria, and the eukaryotic nucleus and the Archaebacteria. Pisani et al.  also built a supertree including only Archaebacteria and Eubacteria and found no support for chimerism in archaebacterial genomes. Instead, they found maximal bootstrap support for the separation of Eubacteria and Archaebacteria. These results are incompatible with those of Nelson-Sathi et al. [36,37], which predict that supertree analyses would partition the Archaebacteria into multiple groups scattered across the Eubacteria. However, the work of Pisani et al.  had limitations: it used a much smaller number of genomes than those available to Nelson-Sathi et al.  and relied upon a parsimony based supertree method with undesirable properties [42–46].
Akanni et al.  recently implemented and tested a new Bayesian supertree method based on Steel & Rodrigo's  maximum-likelihood (ML) supertree computation. Here, we have improved our supertree implementation by correcting the likelihood calculations following the results of Bryant & Steel . This new supertree method was implemented in the phylogenetic package P4  and here we use this method to test Nelson-Sathi et al.'s  results with an independent methodological approach and a different dataset composed of 392 eubacterial and 51 archaebacterial taxa.
5. Material and methods
(a) Defining the dataset
A dataset composed of 834 genomes (including multiple species per genus and in some cases multiple strains per species—and representing all prokaryotic taxa for which a complete genome was available in the NCBI database in early 2013) was assembled and distilled into a dataset of 392 eubacterial and 51 archaebacterial genera (see §6 for details). We deem this dataset large enough to allow a robust test of the results of Nelson-Sathi et al.  while maintaining tractability within the context of a supertree-based phylogenomic analysis. Supertrees were generated using three datasets. The first dataset, hereafter referred to as PROK, is composed of gene trees derived from gene families assembled from the complete set of 51 archaebacterial and 392 eubacterial genera. The second dataset was derived by pruning all of the archaebacterial sequences from the gene trees in PROK. This dataset includes only sequences from the 392 considered eubacterial genera and was named EUBAC. The third dataset was generated by pruning all eubacterial sequences from PROK; it was named ARC, and contains only sequences for the 51 considered Archaebacteria. If Nelson-Sathi et al.  are correct, the PROK supertree should not recover a monophyletic Archaebacteria. Instead, various archaebacterial clades should be scattered across Eubacteria because of the strongly asymmetrical pattern of gene imports (from Eubacteria to Archaebacteria) underpinning the origin of multiple archaebacterial clades . Furthermore, because Archaebacteria to Archaebacteria HGTs do not seem to have significantly impacted archaebacterial evolution , we would expect the emergence of vertical signal in ARC, leading to the recovery of a generally accepted archaebacterial tree of life, with the traditionally recognized archaebacterial phyla and superphyla well supported and arranged as in trees derived from the analysis of ribosomal proteins only [51–55].
If Nelson-Sathi et al.  are incorrect, the analysis of PROK should recover a well-supported monophyletic Archaebacteria emerging as the sister group of an equally well-supported monophyletic Eubacteria. It should be noted that even in this case a tree broadly consistent with the generally accepted archaebacterial tree of life should emerge from the analysis of ARC. Accordingly, the ARC supertree will be used as a benchmark to confirm that our novel supertree implementation performs well. Recovering a scrambled ARC phylogeny should warn us that our software might contain errors, or that the method we implemented has inherent biases or weaknesses.
Finally, inspection of the EUBAC supertree should inform us about the extent to which large-scale, directional, Eubacteria to Eubacteria transfers affected eubacterial evolution. If such events were irrelevant in eubacterial evolution, EUBAC would be expected to return a tree with a topology consistent with that of the generally accepted eubacterial tree of life (e.g. ). On the contrary, if these events were important in eubacterial evolution, eubacterial clades would be scrambled and directionality of transfers (something we do not investigate here) could be investigated through the interpretation of proximity relationships in the EUBAC supertree.
(b) Data acquisition and processing
All prokaryotic proteomes available from the NCBI database in early 2013 (a total of 834 including multiple species across genera and in some cases multiple strains) were downloaded and merged into the PROK database (which included 2 727 153 protein sequences). An all-versus-all blast search was performed (with an e-value cut-off of 10 × 10−8) using BLAST 2.2.19 . Homologous protein families, tribes sensu , were then identified using the Markov Cluster algorithm, MCL . The MCL analysis of PROK (granularity parameter = 1.4) returned 386 576 gene families of which 82 844 included four or more sequences. Families including fewer than four sequences were discarded, as they are not amenable to phylogenetic analysis. The 82 844 gene families that included more than four sequences included 47 725 single gene families (scoring only orthologues—if one assumes no hidden paralogy) and 35 119 multi-gene families (including both orthologues and paralogues).
Examination of the MCL families showed that some of the 35 119 multi-gene families obtained from the MCL analyses included many paralogy groups, which could have been split into orthology sets and used for supertree reconstruction. We further partitioned these multi-gene families using the ‘Randomblast’ algorithm  implemented using a PERL script written by James Cotton (Wellcome Trust Sanger Institute). The Randomblast algorithm works by iteratively choosing a sequence randomly, blasting it against all the other sequences and removing those with a significant hit until all sequences are removed. This method has previously been shown to work well for defining sets of orthologues for supertree reconstruction , as it efficiently breaks multi-gene families into their paralogy groups. In the randomblast analysis, smaller e-values will break each multi-gene family into progressively more numerous families of progressively more closely related taxa. An e-value that is too small would generate very small sets of orthologues that would not be adequate to reconstruct prokaryotic supertrees. Alternative e-values were tested and an e-value of 10 × 10−16 was deemed suitable for this specific dataset. It partitioned the 35 119 multi-gene families that were generated using MCL into 69 070 families, of which 4734 were single gene families including more than four species. These 4734 single gene families were added to the 47 725 single gene families from the MCL analysis to generate a total of 52 459 single gene families.
(c) Building gene trees
To infer gene trees, all 52 459 single gene families were aligned with PRANK . The multiple sequence alignments were curated with Gblocks  using the following parameters: allow gaps in all positions; maximum number of contiguous non-conserved positions = 15 and minimum length of a block = 8. After the Gblocks step, all gene families that were composed of fewer than 100 amino acid positions were discarded as likely to be too short to allow the generation of reliable phylogenetic trees (see also ). Absence of putative phylogenetic signal in the data was tested in the remaining gene families using the permutation tail probability (PTP) test [62,63], significance level p = 0.05—as implemented in PAUP v. 4b10 . All supertree analyses were run at the genus level by retaining only one species of each included genus in each gene family, resulting in a reduction in the number of considered taxa to 392 eubacterial genera and 51 archaebacterial genera (see also §5a). If more than one species belonging to the same genus was present in a given gene family, the retained species was randomly selected. This makes the strong assumption of monophyly of genera that is necessary for improving taxonomic overlap between input trees. All gene families that passed the PTP test (p < 0.05) were used to infer ML trees in RAxML . The GTR + Gamma + F model was used for all alignments longer than 200 amino acids. To avoid overparametrization, the LG + Gamma + F model was used for alignments shorter than 200 amino acids. More parameter-rich models that can account for compositional heterogeneities in the data [50,66] were not used and this is an important limitation of our study because incongruence among gene trees could have been exacerbated by the use of compositionally homogeneous models (LG and GTR) that might not have a good fit to the data. A total of 16 463 partially overlapping gene trees were generated using the above-described strategy and these gene trees constitute the trees in the PROK dataset. Two thousand eight hundred and eighty-seven gene trees contained at least one Archaebacteria and 1512 gene trees show Archaebacteria clustering with Eubacteria. PROK was then used to create EUBAC and ARC by pruning all eubacterial and archaebacterial taxa, respectively, from the trees in PROK. Because some gene trees included only Archaebacteria or Eubacteria and because after pruning, some gene trees were left to include less than four taxa the EUBAC and ARC datasets include, respectively, 14 558 gene trees spanning a total of 392 taxa, and 1776 gene trees spanning a total of 51 taxa.
(d) Identification of unstable taxa
Taxa that are under-represented in gene trees (generally because they have a reduced genome) might be unstable in supertree analyses, artificially increasing the perceived incongruence among gene trees (e.g. ). Here, the concatabomination approach  was used to identify and remove taxa that were likely to be unstable because of poor taxonomic coverage in gene trees. Two more datasets PROK-minus and EUBAC-minus were created where all taxa identified as unstable because of poor taxonomic overlap were pruned from all gene trees. No unstable taxa were identified in ARC (even though Nanoarchaeum was unstable in the context of the PROK dataset). Accordingly, we did not have to create an ARC-minus dataset. Because we wanted to avoid the negative effects of unstable taxa on our results, only the PROK-minus, EUBAC-minus and ARC datasets were subjected to further analyses.
(e) Supertree analyses
The gene trees in PROK-minus, EUBAC-minus and ARC were used as input to Bayesian supertree analyses performed in p4 . All Bayesian analyses were run with two parallel independent chains and with the model parameter set to implement the likelihood model of Steel & Rodrigo , with the normalizing alpha parameter approximated as in  and the beta parameter, a dataset-specific value that reflects concordance among the input trees, set to be a free parameter estimated during tree search. This is different and represents a significant improvement over the Bayesian supertree implementation of Akanni et al.  that was based on the original method of Steel & Rodrigo . All analyses were run until convergence was achieved while sampling every 5000 iterations, see §6 for details referring to each specific analysis. Convergence between the two independent Markov chain Monte Carlo (MCMC) chains was monitored by plotting the sampled trees' likelihood values, and the total number of trees retained post-burn-in varied with analyses. The chains were stopped after they reached convergence and majority-rule consensus trees with minority components were generated from the trees sampled after convergence to generate our Bayesian supertrees. Support for internal branches was estimated with reference to the posterior probabilities (PP) of the recovered splits.
(f) Comparisons with the generally accepted topology of the tree of life
We first tested whether the PROK-minus supertree was significantly better than random using the YAPTP test . To implement the YAPTP test, we generated 100 random trees on the same leaf set as PROK-minus in PAUP v. 4b10. The likelihood of each random tree and of the PROK-minus supertree was obtained using L.U.St.  recoded to implement the ML method of Steel & Rodrigo  as modified in . The latest implementation of L.U.St. can be downloaded from bitbucket (https://email@example.com/afro-juju/l.u.st.git). The distribution of likelihood scores for the random trees and for the PROK-minus supertree were plotted in R to reveal whether the likelihood of PROK-minus was significantly better than that of the random trees. To test alternative hypotheses about the tree of life, a supertree-based version  of the approximately unbiased (AU)  test was used to compare the PROK-minus supertree against the generally accepted topology for the tree of life. The latter was obtained by modifying the tree of Ciccarelli et al. , which has arguably become the most widely used reference topology for the tree of life in both textbooks and the scientific literature, to include all and only the species considered in our study. Given that eukaryotes are not included in our dataset, the fact that the Ciccarelli et al.  tree is outdated (in that it does not display the eocyte topology) is not a problem for our analyses. The supertree-based AU test was calculated using L.U.St.  to obtain input-treewise likelihood values for all gene trees under both compared supertrees. These values were then used as input for CONSEL  that was used to perform the AU test.
(g) Identification of directional Eubacteria to Archaebacteria gene imports
All gene trees in PROK-minus that included at least one archaebacterium (2887 trees) were visually inspected, and the same strategy as used by Pisani et al.  to identify prokaryotic outgroups of eukaryotic genes was used to identify eubacterial outgroups of archaebacterial genes. To root the gene trees, we assumed the topology of the standard tree of life  to be correct. A directional HGT (from Eubacteria to Archaebacteria) was assumed in all instances where a gene was found to have a widespread distribution in Eubacteria but a very limited distribution in (specific to a phylum or to a few related taxa) Archaebacteria. We acknowledge that such gene trees could also be the result of multiple (independent) lineage-specific gene losses; however, such a scenario would be significantly less parsimonious than one assuming a single HGT. In many cases, the direction of transfer could be unambiguously identified: in cases where a gene tree could not be rooted on the Eubacteria–Archaebacteria split while at the same time resolving: (i) Archaebacteria and Eubacteria as monophyletic and (ii) the generally accepted relationships within Archaebacteria and Eubacteria. To clarify, an example would be a tree including an archaebacterial phylum (say the Haloarchaea), and two eubacterial lineages, say Actinobacteria and Proteobacteria. Such a tree will unambiguously support a directional transfer from Eubacteria to Archaebacteria if, when rooted on the Archaebacteria–Eubacteria split, it would display Actinobacteria and Proteobacteria as paraphyletic with reference to each other. On the contrary, a tree where the transfer is most parsimonious but not unambiguous would be one where the rooted tree defined on the Archaebacteria–Eubacteria split is one where Actinobacteria and Proteobacteria emerge as monophyletic. Genes with a broad distribution in Archaebacteria and Eubacteria were assumed to have been vertically inherited, and genes trees where clear monophyletic or paraphyletic groups could not be defined (e.g. where Archaebacteria known to belong to the same phylum were scattered across Eubacteria) were considered ambiguous and not included in our counts. As we did not use trees that could not be clearly interpreted based on current phylogenetic knowledge, our estimated number of imports should be considered conservative. Numbers of imported genes (from Eubacteria) were transformed into proportions of the total number of imports observed to better compare the relevance of imports from different eubacterial groups. For each considered archaebacterial lineage, the average number of transfers across all donors was calculated. The mean number of imports indicates the number of transfers that would be expected from each donor if HGT were randomly distributed. Median, standard deviation, quartiles and donors that contributed an anomalous (significantly high) number of genes to a specific archaebacterial group were identified. Significantly high imports were identified in two different ways. Firstly, for each considered archaebacterial group, a standard Shapiro–Wilk test  was performed (in R) to evaluate whether it was possible to reject the hypothesis that the distribution of imports across donors was normally distributed. If the hypothesis of normality could not be rejected, donors with a significantly high proportion of imports were identified as those falling outside the 95% confidence interval of the considered distribution. If the distribution was not normal, donors that provided an anomalously high number of genes were identified as those falling beyond the third quartile + 1.5 of the interquartile range (IQR). These donors are those that would be identified as falling outside the box and whiskers in a standard Tukey's boxplot. Finally, for all considered archaebacterial groups, the distribution of imports across all donors was visually represented using boxplots. Because imports from two eubacterial lineages (Clostridia and δ-Proteobacteria) were significantly high across many archaebacterial groups and generally high across all Archaebacteria, the above-mentioned approach was repeated twice, once including all imports across all archaebacterial lineages, and once after having excluded Clostridia and δ-Proteobacteria.
Poor taxonomic overlap is a known source of instability in supertree analyses (e.g. ), and it can significantly reduce the branch support and resolution of a supertree. However, lack of resolution in a supertree can also be caused by important biological factors (e.g. HGTs and the signature of symbioses) and it is key to eliminate the effect of unstable taxa if we are to understand the relative strengths of vertical and horizontal signals in the data. Using the concatabomination approach , we identified the genome of Ureaplasma, a member of the Mollicutes (electronic supplementary material, figure S1a), to be the most unstable genus in PROK. Fifteen more unstable taxa were identified (electronic supplementary material, figure S1b–d), only one of which was an archaebacterium (Nanoarchaeum). Exclusion of all these taxa eliminates the instability caused by poor taxonomic overlap across gene trees in PROK. Notably, Nanoarchaeum caused instability in PROK but not in ARC so it was not excluded from the latter. Through the exclusion of unstable taxa in PROK and EUBAC, we generated the PROK-minus and EUBAC-minus datasets. These two datasets, together with ARC, were used for all subsequent analyses.
The PROK-minus Bayesian supertree analysis reached convergence at 1.05 million iterations and a total of 780 trees were sampled from the post-burn-in MCMC chains. The majority-rule consensus with minority components obtained from the sampled trees is our PROK-minus supertree and is presented in figure 1. It has many poorly supported groups (PP < 0.5) indicated by dotted lines in figure 1 and if these were suppressed it would be very poorly resolved. Of the 25 prokaryotic phyla represented in this tree by more than one genus only a few (Deferribacteres, Deinococcus/Thermus, Chlorobi, Fusobacteria, Plantomycetes, Thaumarchaeota, Aquificae and Thermotogae) appear monophyletic. The PROK-minus tree is generally better supported closer to the tips and with deeper nodes poorly supported. This is in line with the results of the previous supertree studies of Creevey et al.  and Pisani et al. , that found that relatively strong vertical signal exists only towards the tips of the prokaryotic tree. Signal erosion in datasets intended to be used to resolve the relationships among the primary lineages of life is in part a consequence of the complexity of trying to infer ancient divergences using limited amounts of often substitutionally saturated sequence data. However, we suggest that in our supertree analysis, poor resolution is primarily a consequence of the signal associated with vertical inheritance not being the principal determinant of prokaryotic evolution. Analyses of PROK-minus failed to recover a supertree that could be rooted in such a way as to make Archaebacteria and Eubacteria monophyletic. In figure 1, this tree has been arbitrarily rooted only for visualization proposes, as an unrooted representation would have been impractical with this number of taxa. Clades cannot be defined on an unrooted tree so groups in this tree should be considered clans (sensu ). In this tree, the clans corresponding to the Methanobacteriales, Methanococcales, Thermococcales and Methanopyrales are interspersed across the Actinobacteria and Bacteriodetes clans . The Methanomicrobiales clan emerges within a clan mostly composed of δ-Proteobacteria. The Archaeoglobales, Thermoplasmatales and Aciduliprufundum emerge in a clan with β-Proteobacteria. Halobacteria, Methanocellales and Methanosarcinales form clans that also include γ-Proteobacteria. Sulfolobales form a clan of their own, while Desulfurococcales are interspersed across γ- and α-Proteobacteria. Finally, Thaumarchaeota and Thermoproteales nest in a clan including δ-Proteobacteria, Planctomycetes, Cyanobacteria and Chlamidiae/Verrucomicrobia. Despite its unconventional topology, the YAPTP test showed that the PROK-minus tree is not random (figure 2), and the AU test showed that it fits our dataset significantly better than a tree displaying the generally accepted topology for the tree of life (p = 1.00 × 10−112). Note, however, that most of the above-mentioned relationships have PP < 0.5 and should not be interpreted as sister-group relationships between the considered taxa. Rather, we suggest that our results should be taken to indicate that there are multiple, contradictory, vertical and horizontal signals in the data.
The EUBAC-minus Bayesian analysis reached convergence at 2.3 million, and a total of 680 trees were sampled from the post-burn-in MCMC chains and summarized using the majority-rule consensus method with minority components to derive the EUBAC-minus supertree (figure 3). Similarly to the case of the PROK-minus supertree also the EUBAC-minus supertree was arbitrarily rooted; it has to be considered as an unrooted tree and groups in this tree should be considered clans rather than clades (see above). Eubacterial relationships inferred from the EUBAC-minus supertree do not represent a significant improvement with reference to those in the PROK-minus supertree of figure 1.
Proportions and origins of archaebacterial genes with horizontal history are reported in table 1, electronic supplementary material, table S1, and figure 4 together with descriptive statistics. Our results suggest that there is evidence that relatively large numbers of genes of eubacterial origin have entered specific archaebacterial groups independently. The average proportions of imports in the two tables indicate the expected gene flows under the assumption that imports are randomly distributed across all donors. Figure 4 is a boxplot representation of the data in electronic supplementary material, table S1, and it helps identify eubacterial taxa that seem to have contributed significant numbers of genes to specific archaebacterial groups. Of all considered eubacterial lineages, only three (Actinobacteria, Clostridia and δ-Proteobacteria) show significantly high exports towards the Archaebacteria when all donors are included in the analysis (electronic supplementary material, table S1). These three lineages are not the most highly represented in our dataset and thus these results do not seem to be dependent on eubacterial sampling density. In detail, an anomalously high number of imports can be observed from Clostridia and δ-Proteobacteria into most archaebacterial groups (electronic supplementary material, table S1, and figure 4), and from Actinobacteria into Thermoplasmatales (electronic supplementary material, table S1). A high, even if not significant, number of imports from Actinobacteria is also observed into Haloarchaea, Sulfolobales and Thermoproteales.
Repeating the analyses after having excluded Clostridia and the δ-Proteobacteria (table 1) showed that once these ‘outliers' are removed other significant donors emerge. In particular, Actinobacteria now emerge as having donated significantly high proportions of genes to Sulfolobales, Thermoproteales, Thermoplasmatales and Desulfurococcales, with Haloarchaea still being high but not significant. β-Proteobacteria seem to have significantly contributed to the Acidolobales, and γ-Proteobacteria to the Archaeoglobales.
The ARC Bayesian analysis reached convergence after 500 000 iterations and 600 post-burn-in trees were used to build the ARC supertree (figure 5). In contrast to the analyses of PROK-minus and EUBAC-minus, the ARC analysis (figure 5) returned a tree that is in excellent agreement with those recovered from studies based on ribosomal proteins only (e.g. [51–55]). Accordingly, this tree was rooted following previous studies in archaebacterial evolution, and the groups in this tree, contrary to the case of PROK-minus and EUBAC-minus, represent clades, not clans. In this tree, Haloarchaea emerges from the methanogens, and Crenarchaeota can be seen as the sister group of the Thaumarchaeota. In addition to having a topology comparable to that of other archaebacterial phylogenies, the ARC supertree is also ‘perfectly’ supported, that is, all splits in this tree have PP = 1.
Our analyses did not recover a tree for Archaebacteria and Eubacteria that reflects the relationships expected according to the generally accepted topology of the tree of life. However, the results of the YAPTP test and the AU test indicate that our analyses found a tree that is not random and has better fit to our data than the generally accepted tree of life. These results might seem counterintuitive but are not. The methods implemented in our analysis are bound to return a tree based on the strongest signal in the data. Because we used genes sampled from across all genomes rather than a subset of functionally and evolutionarily-related proteins cleaned from all suspected HGTs, as it was done in Ciccarelli et al. , for example, the supertree is a composite derived from the interactions of vertical and horizontal signals. When seen in this way, our results indicate that there are congruent horizontal signals in the data that are strong enough to eclipse the vertical signals. We conjecture that, as suggested by Nelson-Sathi et al. , this is probably because Eubacteria to Archaebacteria imports are not randomly distributed. Rather, specific archaebacterial lineages mostly imported genes from well-defined eubacterial donors (e.g. δ-Proteobacteria, Clostridia and Actinobacteria; see electronic supplementary material, table S1; table 1 and figure 4). As a consequence of having imported large numbers of eubacterial genes from multiple sources, Archaebacteria are scattered across Eubacteria in the PROK-minus supertree. When Eubacteria are excluded from the analyses (i.e. when ARC is analysed), we obtain very strong support for the generally accepted archaebacterial tree, PP = 1 across all nodes. We suggest that this result confirms that the unusual topology of PROK-minus is a consequence of large imports of genes by Archaebacteria. Overall, we suggest that our results should be interpreted as supporting the hypothesis of Nelson-Sathi et al. , that while massive gene flows from Eubacteria are concomitant with the origin of archaebacterial clades, Archaebacteria to Archaebacteria transfers and exports from Archaebacteria to Eubacteria have been significantly less common throughout the history of life.
Ancestral Archaebacteria seem to have integrated large numbers of genes primarily from Eubacteria. Such large directional influxes of genes to well-defined archaebacterial recipients are consistent with the idea that a single eubacterial donor might have been in some way engulfed by the archaebacterial recipients passing its genes en masse to the recipient. Perhaps this happened through processes of phagocytosis followed by chromosomal recombination, or through processes of endosymbiosis, whereby the symbiont was progressively simplified and its genes transferred to the host nucleus. Similar simplification processes are known to have happened in eukaryotic organelles , nucleomorphs , and in extant animal symbionts like the Blochmannia floridanus symbionts of ants .
Nelson-Sathi et al.  identified six archaebacterial groups the origins of which seem to have been coincident with large-scale imports from Eubacteria. These are Thermoproteales, Desulfurococcales, Methanobacteriales, Methanococcales, Methanosarcinales and Haloarchaea. For the latter of these groups, these authors were able to identify Actinobacteria as the primary source of eubacterial genes [36,37]. Here, we have corroborated the latter result even though the actinobacterial import into Haloarchaea, while high, is not statistically significant. In line with the results of Nelson-Sathi and co-workers, we identified further imports from Actinobacteria into Thermoproteales and Desulfurococcales. However, we also identified significant Actinobacterial imports in Sulfolobales and Thermoplasmatales (table 1). Because Thermoproteales, Sulfolobales, Desulfurococcales form a clade in our archaebacterial tree (figure 5) and given that also Acidolobales displayed a relatively high proportion of actinobacterial genes (electronic supplementary material, table S1; table 1), this result is suggestive of a single chimerization event that involved an actinobacterium and a common ancestor of these phyla (figure 5). We further found that almost all archaebacterial lineages have an often-significant excess of genes shared with δ-Proteobacteria and Clostridia (electronic supplementary material, table S1; figure 4). Because genes shared with δ-Proteobacteria and Clostridia have a broad archaebacterial distribution, these genes are suggestive of either two more chimerization events that happened in the archaebacterial stem lineage, or of two large-scale transfers from Archaebacteria to Eubacteria (figure 5). Finally, significantly higher imports from γ-Proteobacteria into Archaeoglobales and from β-Proteobacteria into Acidolobales were identified, suggestive of two more chimerization events.
The EUBAC analysis, by failing to identify traditional eubacterial phyla as potentially monophyletic, indicates that directional Eubacteria to Eubacteria transfers might have been common in eubacterial evolution. For example, the Coriobacteriaceae seem to have been involved in directional HGT from Bacteroidetes, Ehrlichiaceae from the γ-Proteobacteria, and the Rickettsiaceae from the Nitrosomonadales (γ-Proteobacteria). Overall, the complexity of the EUBAC tree indicates that HGT had a greater impact in eubacterial evolution, and much more detailed analyses would be necessary to better understand directional patterns and the magnitude of HGTs in this primary lineage of life.
We performed an updated supertree analyses for the eubacterial and archaebacterial lineages using a recently developed and seemingly well-founded, Bayesian supertree method. Our results could not recover a monophyletic Archaebacteria when all eubacterial and archaebacterial genomes were considered simultaneously. These results are in disagreement with a previous supertree study . While we did not address what could have caused this discrepancy in detail, differences between the two studies included the supertree method used and the number of genomes considered, and differences between the two studies are most likely a consequence of one or both of these factors.
How to interpret the generally accepted tree of life in light of HGTs and symbiotic events has long been debated [6,14,21,22,27,37,76,77], and the traditional interpretation of the ‘tree of life’, as representing the major determinant of the evolutionary processes that underpinned the origin and early diversification of life on Earth, has become obsolete . Yet Puigbò and co-workers suggested that the tree topology represented by the canonical tree of life should still be seen as a statistical central tendency: a tree topology embedded in a phylogenetic network of life. This is because, according to these authors, this is the only tree topology that is broadly agreed upon by at the least a subset of genes across all of life: the NUTs (nearly universal trees) of Puigbò et al. . Our supertrees partially reject the view of Puigbò et al. . This is because, while a central tendency can be defined using large numbers of genes, a non-random supertree is recovered from the analyses of the PROK dataset with a topology that is different from and in substantial disagreement with Puigbò's NUTs  and with that of the generally accepted 18S rRNA tree of life [79,80]. At the same time, our results suggest that the traditional archaebacterial tree as recovered from the 18S rRNA and various datasets composed of ribosomal proteins is indicative of a real evolutionary pattern, as this tree topology can be recovered from the ARC dataset (i.e. when eubacterial lineages are pruned out from PROK). Indeed, also in the previous supertree study of Pisani et al. , nodes in Archaebacteria had higher support than nodes in Eubacteria, and Nelson-Sathi et al. [36,37] pointed out that Archaebacteria are less prone than Eubacteria to engage in HGTs. However, how to interpret the tree in figure 5 in light of the tendency of Archaebacteria to engage in large-scale transfers from Eubacteria (and perhaps to Eubacteria) is far from obvious. Certainly, the tree of figure 5 cannot be interpreted, in isolation, as representing the principal determinant of archaebacterial evolution, or as the complete evolutionary history of the archaebacterial genomes, as it does not describe the large-scale imports that seem to have shaped archaebacterial genome evolution. It certainly seems to indicate that vertical evolutionary processes are more important in Archaebacteria than they are in Eubacteria.
Our results are consistent with recent findings [14,36,37] suggesting that the origin of major archaebacterial lineages was coincident with large-scale gene imports into Archaebacteria. However, in addition, we also identified two large-scale transfers (not necessarily imports) at the base of the Archaebacteria. These might be the first evidence for large-scale directional transfers from Archaebacteria to Eubacteria, but further tests would be necessary to better understand this result. Together with the absence of primitively amitochondriate eukaryotes , recent discoveries of the existence of giant Archaebacteria with eubacterial ectosymbionts , the presence of eukaryotic-like actin genes across Archaebacteria , biotechnological evidence indicating that Archaebacteria can undergo cell fusion followed by the generation of recombinant chromosomes [34,35], evidence that heterochiral hybrid membranes consisting of a mixture of glycerol-1- and glycerol-3-phosphate lipids can be synthesized and are stable  and the recent discovery of the Lokiarchaeota, with their sophisticated membrane remodelling capabilities and large repertoire of genes that in eukaryotes are related to phagocytosis , our results reinforce evidence in favour of a symbiotic origin of the Eukaryota.
The encounter between the eocyte and the α-proteobacterial mitochondrial ancestor was a momentous event in the history of life, and most likely it was an obvious consequence of archaebacterial ecology.
All data have been deposited into the Dryad dataset: http://dx.doi.org/10.5061/dryad.2r732.
D.P., W.A.A. and J.O.M. designed the experimental protocol. W.A.A. ran all the analyses. W.A.A. and D.P. created the figures and tables. W.A.A., P.G.F., C.C. and M.W. implemented the methods. K.S. ran the concatabomination analysis. All authors contributed to the writing of the manuscript.
We declare we have no competing interests.
W.A.K. and M.W. were supported by a BBSRC grant no. BB/K007440/. M.W. was additionally supported by Templeton Foundation grant no. 43915. D.P. and J.O.M.C.I. were supported by a Science Foundation Ireland grant no. RFP-EOB-3106, and by a Templeton Foundation grant no. 48177.
The authors thank Prof. Embley and Dr Williams for inviting us to contribute to this special issue of Philosophical Transactions of the Royal Society.
One contribution of 17 to a theme issue ‘Eukaryotic origins: progress and challenges’.
- Accepted July 9, 2015.
- © 2015 The Author(s)
Published by the Royal Society. All rights reserved.