Fast, accurate and simulation-free stochastic mapping

Vladimir N Minin, Marc A Suchard

Abstract

Mapping evolutionary trajectories of discrete traits onto phylogenies receives considerable attention in evolutionary biology. Given the trait observations at the tips of a phylogenetic tree, researchers are often interested where on the tree the trait changes its state and whether some changes are preferential in certain parts of the tree. In a model-based phylogenetic framework, such questions translate into characterizing probabilistic properties of evolutionary trajectories. Current methods of assessing these properties rely on computationally expensive simulations. In this paper, we present an efficient, simulation-free algorithm for computing two important and ubiquitous evolutionary trajectory properties. The first is the mean number of trait changes, where changes can be divided into classes of interest (e.g. synonymous/non-synonymous mutations). The mean evolutionary reward, accrued proportionally to the time a trait occupies each of its states, is the second property. To illustrate the usefulness of our results, we first employ our simulation-free stochastic mapping to execute a posterior predictive test of correlation between two evolutionary traits. We conclude by mapping synonymous and non-synonymous mutations onto branches of an HIV intrahost phylogenetic tree and comparing selection pressure on terminal and internal tree branches.

Keywords:

1. Introduction and background

Reconstructing evolutionary histories from present-day observations is a central problem in quantitative biology. Phylogenetic estimation is one example of such reconstruction. However, phylogenetic reconstruction alone does not provide a full picture of an evolutionary history, because evolutionary paths (mappings) describing trait states along the phylogenetic tree remain hidden. Although one is rarely interested in detailed reconstruction of such mappings, certain probabilistic properties of the paths are frequently used in evolutionary hypotheses testing (Nielsen 2002; Huelsenbeck et al. 2003; Leschen & Buckley 2007). For example, given a tree and a Markov model of amino acid evolution, one can compute the expected number of times a transition from a hydrophobic to a hydrophilic state occurs, conditional on the observed amino acid sequence alignment. Such expectations can inform researchers about model adequacy and provide insight into the features of the evolutionary process overlooked by standard phylogenetic techniques (Dimmic et al. 2005).

Nielsen (2002) introduced stochastic mapping of trait states on trees and employed this new technique in a model-based evolutionary hypothesis testing context. The author starts with a discrete evolutionary trait X that attains m states. He further assumes that this trait evolves according to an evolutionary model described by a parameter vector θ, where θ consists of a tree τ with n tips and branch lengths Embedded Image, root distribution Embedded Image and a continuous-time Markov chain (CTMC) generator Q={qij} for Embedded Image. Let mapping Embedded Image be a collection of CTMC trajectories along all branches of τ and H(Mθ) be a real-valued summary of Mθ. Clearly, even when parameters θ are fixed, h(Mθ) remains a random variable. Nielsen (2002) proposed to test evolutionary hypotheses using prior and posterior expectations Embedded Image and Embedded Image, where Embedded Image are trait values observed at the n tips of τ. Since these expectations are deterministic functions of θ and D, they can be used as discrepancy measures for posterior predictive p-value calculations (Meng 1994; Gelman et al. 1996).

A major advantage of Nielsen's stochastic mapping framework is its ability to account for uncertainty in model parameters, including phylogenies. A major limitation of stochastic mapping is its current implementation that relies on time-consuming simulations. In describing his method for calculating Embedded Image and Embedded Image, Nielsen (2002) wrote ‘In general, we can not evaluate sums in equations 5 and 6 directly, because the set [of all possible mappings] is not of finite size.’ However, the infinite number of possible mappings does not prevent one from explicitly calculating Embedded Image for some choices of H. For example, ifEmbedded Image(1.1)then Embedded Image, the familiar phylogenetic likelihood that can be evaluated without simulations (Felsenstein 2004). Therefore, hope remains that other choices of H may also permit evaluation of Embedded Image and Embedded Image without simulations.

In this paper, we consider a class of additive mapping summaries of the formEmbedded Image(1.2)where Embedded Image is a one-dimensional summary of the Markov chain path along branch b and Ω is an arbitrary subset of all branches of τ. Moreover, we restrict our attention to the two most popular choices of function h. Let Embedded Image be a set of ordered index pairs that label transitions of trait X. For each Markov path {Xt} and interval [0,t), we count the number of labelled transitions in this interval and arrive atEmbedded Image(1.3)where we omit dependence on θ and t for brevity. Although our second choice of h is more abstract, it is motivated by Huelsenbeck et al. (2003), who used Nielsen's stochastic mapping algorithm to calculate the mean dwelling time of a trait in a particular state. Let Embedded Image be a set of rewards assigned to each trait state. Trait X is ‘rewarded’ the amount t×wi for spending time t in state i. We obtain the total reward of Markov path {Xt} by summing up all rewards that X accumulates during interval [0,t),Embedded Image(1.4)To obtain dwelling times of X in a predefined set of trait states, we set wi=1 if i belongs to the set of interest and wi=0 otherwise.

For these two choices of function h, we provide an algorithm for exact, simulation-free computation of Embedded Image and Embedded Image. Similar to phylogenetic likelihood calculations of Pr(D), this algorithm relies on the eigen decomposition of Q and requires traversing τ. Our work generalizes and unifies exact calculations previously developed for stochastic mapping (Holmes & Rubin 2002; Guindon et al. 2004; Dutheil et al. 2005). Despite the restricted form of the considered mapping summaries, our results cover nearly all current applications of stochastic mapping. We conclude with two applications of stochastic mapping illustrating the capabilities of exact computation. In our first example, we examine coevolution of two binary traits and demonstrate that a previously developed simulation-based test of independent evolution can be executed without simulations. We then turn to a large codon Markov state space, on which simulation-based stochastic mapping generally experiences severe computational limitations. Using our exact computations, we study temporal patterns of synonymous and non-synonymous mutations in intrahost HIV evolution.

2. Local, one branch calculations

In this section, we provide mathematical details needed for calculating expectations of stochastic mapping summaries on one branch of a phylogenetic tree. We first motivate the need for such local computations by making further analogies between calculations of Pr(D) and expectations of stochastic mapping summaries. The additive form of H reduces calculation of Embedded Image and Embedded Image to compute branch-specific expectations Embedded Image and Embedded Image. Recall that according to the most phylogenetic models, trait X evolves independently on each branch of τ, conditional on trait states at all internal nodes of τ. This conditional independence is the key behind the dynamic programming algorithm that allows for efficient calculation of Pr(D) (Felsenstein 1981). For this likelihood calculation algorithm, it suffices to compute finite-time transition probabilities Embedded Image, whereEmbedded Image(2.1)for arbitrary branch length t. Similarly, to obtain Embedded Image and Embedded Image, we require means of computing local expectations Embedded Image, whereEmbedded Image(2.2)and 1{.} is the indicator function. After illustrating how to compute Embedded Image and Embedded Image without resorting to simulations, we provide an algorithm that efficiently propagates local expectations E(h, t) and finite-time transition probabilities P(t) along τ to arrive at Embedded Image and Embedded Image.

(a) Expected number of labelled Markov transitions

Abstracting from phylogenetics, let Embedded Image count the number of labelled transitions of a CTMC {Xt} during time interval [0,t). It follows from the theory of Markov chain-induced counting processes thatEmbedded Image(2.3)where Embedded Image (Ball & Milne 2005). Since most evolutionary models are locally reversible, we can safely assume that Q is diagonalizable with eigen decomposition Embedded Image, where eigenvectors of Q form the columns of U, Embedded Image are the real eigenvalues of Q, and Embedded Image is a diagonal matrix with elements Embedded Image on its main diagonal. Such analytic or numeric diagonalization procedure permits calculation of finite-time transition probabilities Embedded Image, needed for likelihood calculations (Lange 2004). Minin & Suchard (2008) have shown that one can use the same eigen decomposition of Q to calculate local expectationsEmbedded Image(2.4)where Embedded Image, Ei is a matrix with zero entries everywhere except at the ii-th entry, which is one, andEmbedded Image(2.5)

(b) Expected Markov rewards

For the reward process Rw(t), we define a matrix cumulative distribution function Embedded Image, whereEmbedded Image(2.6)Neuts (1995, ch. 5) demonstrated that local reward expectations can be expressed asEmbedded Image(2.7)whereEmbedded Image(2.8)is the Laplace–Stieltjes transform of V(x, t). It is easy to see that the matrix exponential in equation (2.8) satisfies the following differential equation:Embedded Image(2.9)Differentiating this matrix differential equation with respect to s, exchanging the order of differentiation and evaluating both sides of the resulting equation at s=0, we arrive at the differential equation for local expectationsEmbedded Image(2.10)where E(Rw, 0) is the m×m zero matrix. Multiplication of both sides of equation (2.10) by integrating factor Embedded Image from the right and integration with respect to t produces the solution,Embedded Image(2.11)Similarity between equations (2.3) and (2.11) invites calculation of the expected Markov rewards via spectral decomposition of Q,Embedded Image(2.12)In summary, formulae (2.4) and (2.12) provide a recipe for exact calculations of local expectations for the number of labelled transitions and rewards.

3. Assembling pieces together over a tree

(a) Notation for tree traversal

Let us label the internal nodes of τ with integers {1, …, n−1} starting from the root of the tree. Recall that we have already arbitrarily labelled the tips of τ with integers {1, …, n}. Let Embedded Image be the set of internal branches and Embedded Image be the set of terminal branches of τ. For each branch bEmbedded Image, we denote the internal node labels of the parent and child of branch b by p(b) and c(b), respectively. We use the same notation for each terminal branch b except p(b), which is an internal node index, while c(b) is a tip index. Let Embedded Image denote the internal node trait states. Then, the complete likelihood of unobserved internal node states and the observed states at the tips of τ isEmbedded Image(3.1)We form the likelihood of the observed data by summing over all possible states of internal nodes,Embedded Image(3.2)Clearly, when data on the tips are not observed, the prior distribution of internal nodes becomesEmbedded Image(3.3)

(b) Posterior expectations of mapping summaries

Consider an arbitrary branch b* connecting parent internal node p(b*) to its child c(b*). First, we introduce restricted moments,Embedded Image(3.4)The expectation (3.4) integrates over all evolutionary mappings consistent with D on the tips of τ. Invoking the law of total expectation and the definition of conditional probability, we deduceEmbedded Image(3.5)The last expression in derivation (3.5) illustrates that in order to calculate the posterior restricted moment (3.4) along branch Embedded Image, we merely need to replace finite-time transition probability Embedded Image with local expectation Embedded Image in the likelihood formula (3.2). Similarly, if Embedded Image, we substitute Embedded Image for Embedded Image in (3.2). Given matrices Embedded Image for Embedded Image and Embedded Image, we can sum over internal node states using Felsenstein's pruning algorithm to arrive at the restricted mean Embedded Image and then divide this quantity by Pr(D) to obtain Embedded Image.

This procedure is efficient for calculating the posterior expectations of mapping summaries for one branch of τ. However, in practice, we need to calculate the mapping expectations over many branches and consequently execute the computationally intensive pruning algorithm many times. A similar problem is encountered in maximum-likelihood phylogenetic estimation during differentiation of the likelihood with respect to branch lengths. An algorithm, reviewed by Schadt et al. (1998), allows for computationally efficient, repeated replacement of one of the finite-time transition probabilities with an arbitrary function of the corresponding branch length in equation (3.2). This algorithm finds informal use since the 1980s in pedigree analysis (Cannings et al. 1980) and is implemented in most maximum-likelihood phylogenetic reconstruction software such as PAUP (J. Huelsenbeck 2007, personal communication).

Let Embedded Image be the vector of forward, often called partial or fractional, likelihoods at node u. Element Fui is the probability of the observed data at only the tips that descend from the node u, given that the state of u is i. If u is a tip, then we initialize partial likelihoods via Embedded Image. In case of missing or ambiguous data, Du denotes the subset of possible trait states, and forward likelihoods are set to Embedded Image. During the first, upward traversal of τ, we compute forward likelihoods for each internal node u using the recursionEmbedded Image(3.6)where b1 and b2 are indices of the branches descending from node u and c(b1) and c(b2) are the corresponding children of u. We denote the directional likelihoods in square brackets of equation (3.6) by Embedded Image and Embedded Image and record Embedded Image together with Fu. Finally, we define backward likelihoods Embedded Image, where Gui is the probability of observing state i at node u together with other tip states on the subtree of τ obtained by removing all lineages downstream of node u. A second, downward traversal of τ yields Gu given the precomputed Sb.

For each branch b*, we can sandwich Embedded Image among the forward, directional and backward likelihoods and writeEmbedded Image(3.7)where b′ is the second branch descending from the parent node p(b*). Therefore, with Fu, Sb and Gu precomputed for all nodes and branches of τ, we can replace Embedded Image with any other quantity for any arbitrary branch b without repeatedly traversing τ. In particular, the posterior restricted moment for branch b* can be expressed asEmbedded Image(3.8)In figure 1, we use an example tree to illustrate the correspondence between each quantity in sandwich formula (3.8) and the part of the tree involved in this quantity computation. We summarize all steps that lead to the computation of the global mean Embedded Image in algorithm 1. Note that Pr(D), needed to transition between conditional and restricted expectations in formula (3.4), is computed with virtually no additional cost in step (iv) of the algorithm.

Figure 1

Sandwich formula illustration. (a) An example phylogenetic tree in which we label internal nodes numerically and two branches b* and b′. We break this tree at nodes 3 and 4 into the subtrees shown in (b). Assuming that the trait states are i and j at nodes 3 and 4, respectively, we mark each subtree by the corresponding quantity needed for calculating the posterior expectation of a mapping summary on branch b*.

View this table:
Algorithm 1

Calculating posterior expectations E[H(Mθ)|D].

(c) Pulley principle for evolutionary mappings

Suppose that we are interested in a mean mapping summary Embedded Image, obtained as a sum of local mapping summaries over all branches of the phylogenetic tree τ. We would like to know whether quantity Embedded Image changes when we move the root of τ to a different location.

Recall that reversibility of the Markov chain {Xt} makes Pr(D) invariant to the root placement in τ if the root distribution π is the stationary distribution of {Xt} (Felsenstein 1981). Felsenstein's pulley principle rests on the detailed balance conditionEmbedded Image(3.9)and Chapman–Kolmogorov relationshipEmbedded Image(3.10)Applying both formulae (3.9) and (3.10) once in equation (3.2) allows one to move the root to any position along the two root branches without changing Pr(D). Therefore, Pr(D) is invariant to moving the root to any position on any branch of τ since we can repeatedly apply the detailed balance condition and the Chapman–Kolmogorov equation.

Invariance of Pr(D) to root placement together with formula (3.4) suggests that root position invariance of conditional expectations Embedded Image holds if and only if invariance of joint expectations Embedded Image is satisfied. Consider a two-tip phylogeny with branches of length t1 and t2 leading to observed trait states D1 and D2, respectively. According to formulae (1.2) and (3.5), we may expect thatEmbedded Image(3.11)depends only on the sum t1+t2. Therefore, we can move the root anywhere on this phylogeny without altering the expectations if {Xt} is reversible. It is easy to see that repeated application of derivation (3.11) readily allows for extension of the root invariance principle to n-tip phylogenies.

In derivation (3.11), we use identitiesEmbedded Image(3.12)andEmbedded Image(3.13)Equation (3.12) splits computation of the expected summary on interval [0, t1+t2) into calculations bound to intervals [0,t1) and [t1,t1+t2) with the help of the total expectation law and Markov property. This derivation parallels that of Chapman–Kolmogorov equation (3.10). Identity (3.13) requires more care as it does not hold for all choices of function h. Using equation (2.12), detailed balance condition (3.13) holds for h2=Rw. However, equation (2.3) suggests that we only can guarantee the detailed balanced condition (3.13) for Embedded Image when Embedded Image if and only if Embedded Image.

(d) Prior expectations of mapping summaries

In many applications of stochastic mapping, one wishes to compare the prior to posterior expectations of summaries (Nielsen 2002). In this section, we derive the formulae necessary for computing prior expectations. Similar to our derivation of posterior expectations, we begin by considering an arbitrary branch b* and use the law of total expectation to arrive atEmbedded Image(3.14)where Embedded Image is the marginal local expectation of the mapping summary.

Identity Embedded Image allows us to eliminate summation over some internal node states in formula (3.14) and consider only those internal nodes that lie on the path connecting the root of τ and c(b*). If π is the stationary distribution of {Xt}, then formula (3.14) together with identities Embedded Image and Embedded Image simplifies prior local expectations even further,Embedded Image(3.15)

The fact that prior local expectations at stationarity compute with virtually no additional burden has immediate practical implications. In the context of the posterior predictive model checking, researchers often need to simulate L independent and identically distributed (iid) realizations Embedded Image of data at the tips of τ and then calculate the summary-based discrepancy measure Embedded Image (Nielsen 2002). Since we know the ‘true’ model under simulation, we can approximate this discrepancy measure with the prior expectationEmbedded Image(3.16)where D* ranges over all possible trait values at the tips of τ. In molecular evolution applications, L is of the order of 103–105, and hence approximation (3.16) should work well.

(e) Comparison with simulation-based stochastic mapping

We comment earlier that the Monte Carlo algorithm for stochastic mapping is an alternative and a very popular way to compute expectations of mapping summaries. This algorithm consists of two major steps. In the first step, the tree is traversed once to compute Fu for each node. In the second step, internal node states i are simulated conditional on D (Pagel 1999). Then, conditional on i, one simulates CTMC trajectories on each branch of the tree and computes summaries of interest. This second step is repeated N times producing a Monte Carlo sample of mapping summaries whose averages approximate the branch-specific expectations.

The running time of the algorithm depends on N and the computational efficiency of generating CTMC trajectories. The number of samples N, required for accurate estimation, varies with Q and T. Unfortunately, this aspect of stochastic mapping is largely ignored in the literature and by practitioners. Although more than one way exists to simulate trajectories conditional on starting and ending states (Rodrigue et al. 2008), rejection sampling is the most common (Nielsen 2002). Assessing the efficiency of rejection sampling is complicated, because the efficiency depends not only on the choice of CTMC parameters, but also on the observed data patterns D. We illustrate these difficulties using a CTMC on a state space of 64 codon triplets. We take two sites from an alignment of 129 HIV sequences that we discuss later in one of our examples. We call the first site slow evolving as it obtains only three different codons. The second, fast evolving site emits nine different codons. We take one-parameter slice of our Markov chain Monte Carlo (MCMC) sample and run rejection sampling to estimate the expected number of mutations for all branches of τ. For each trial, we record the total number of rejections on all branches of τ and the absolute errors of the estimated mean number of synonymous mutations summed over all branches. We summarize the results of this experiment in table 1. We see a 400× increase in the number of rejections required to simulate trajectories conditional on the fast site compared to equivalent simulations based on the slow site. The Monte Carlo error decreases as the number of Monte Carlo iterations increases, but not as fast as one would hope.

View this table:
Table 1

Efficiency and accuracy of stochastic mapping. (For each number of iterations, we report the median number of rejected CTMC trajectories over the entire phylogenetic tree per iteration and the sum of absolute errors (SAE) of simulation-based estimates of the mean number of synonymous mutations along branches of the phylogenetic tree.)

In summary, simulation-based stochastic mapping requires simulation of CTMC trajectories; this is not a trivial computational task. Assessing accuracy of methods is cumbersome and difficult to automate. Our algorithm 1 replaces both simulation components from stochastic mapping calculations and therefore should be a preferred way of calculating expectations of mapping summaries.

4. Examples

(a) Detecting coevolution via dwelling times

In this section, we reformulate a previously developed simulation-based method for detection of correlated evolution (Huelsenbeck et al. 2003) in terms of a Markov reward process. We consider two primate evolutionary traits, oestrus advertisement (EA) and multimale mating system (MS), analysed by Pagel & Meade (2006). These authors first use cytochrome b molecular sequences to estimate the posterior distribution of the phylogenetic relationship among 60 Old World monkeys and ape species. Using 500 MCMC samples from the posterior distribution of phylogenetic trees, Pagel and Meade run another MCMC chain, this time with a reversible jump component that explores a number of CTMC models involving EA and MS traits and assess the models' posterior probabilities to learn about EA/MS coevolution. The authors find support in favour of a hypothesis stating that EA presence correlates with MS presence. The trait data are shown in figure 2 together with a phylogenetic tree, randomly chosen from the posterior sample. While this is not the case for Pagel & Meade (2006), reversible jump MCMC for model selection can be difficult to implement, especially as the number of trait states grows. Methods that simply require data fitting under the null model are warranted. Consequentially, we revisit this dataset and apply posterior predictive model diagnostics to test the hypothesis of independent evolution between EA and MS traits.

Figure 2

Primate trait data. We plot a phylogenetic tree, randomly chosen from the posterior sample, of 60 primate species. Branches of the sampled tree are not drawn to scale, nor is the tree ultrametric. Taxa names and trait values (‘0’, absence; ‘1’, presence; ‘−’, missing) for oestrus advertisement (EA) and multimale mating system (MS) are depicted at the tips of the tree.

Our null model assumes that EA and MS evolve independently as 2 two-state Markov chains Embedded Image, Embedded Image, where 0 and 1, respectively, stand for trait absence and presence. Let the infinitesimal generators of the EA and MS CTMCs beEmbedded Image(4.1)We form a product Markov chain Embedded Image on the state space Embedded Image that keeps track of presence/absence of the two traits simultaneously assuming that they evolve independently. The generator of the product chain is obtained via the Kronecker sum (⊕),Embedded Image(4.2)The Kronecker sum representation extends to general finite state-space Markov chains and to an arbitrary number of independently evolving traits (Neuts 1995). Computationally, this representation is advantageous, because eigenvalues and eigenvectors of a potentially high-dimensional product chain generator derive analytically from eigenvalues and eigenvectors of low-dimensional individual generators (Laub 2004).

To test the independent evolution model fit via posterior predictive diagnostics, we need a discrepancy measure (Meng 1994). Following Huelsenbeck et al. (2003), we employ mean dwelling times to form a discrepancy measure. Let Embedded Image be the tree length of τ. We define the mean dwelling times Embedded Image and Embedded Image of traits Embedded Image and Embedded Image in state i, and the mean dwelling time Zij of the product chain Yt in state (i, j) for i, j=0, 1. More formally, we setEmbedded Image(4.3)where Embedded Image, Embedded Image, Embedded Image, Embedded Image, Embedded Image, Embedded Image and D(1), D(2) are observations of the two traits on the tips of τ.

Using the dwelling times, we define ‘expected’ ϵij and ‘observed’ ηij fractions of time the two traits spend in states i and j,Embedded Image(4.4)We use quotation marks because both quantities are not observed. To quantify the deviation from the null hypothesis of independence seen in the data, we introduce the discrepancy measure,Embedded Image(4.5)This measure returns small values under the null and implicitly depends on τ and branch lengths T. We account for this dependence and phylogenetic uncertainty by averaging our results over a finite sample from the posterior distribution of τ and T, obtained from the molecular sequence data.

We use the software package BayesTraits to accomplish this averaging and to produce a MCMC sample from the posterior distribution of QEA and QMS, assuming the null model of independent evolution of the two traits (Pagel et al. 2004). Each iteration sample from the output of BayesTraits consists of Embedded Image drawn from their posterior distribution. Given these model parameters, we generate a new dataset Drep and compute the observed Embedded Image and predicted Embedded Image discrepancies for each iteration. We then compare their marginal distributions by plotting their corresponding histograms (figure 3). In figure 3, we also plot the observed against predicted discrepancies to display the correlation between these two random variables. The apparent disagreement between observed and predicted discrepancies is a manifestation of poor fit of the independent model of evolution. The observed discrepancy consistently exceeds the predicted quantity. To illustrate the performance of predictive diagnostics when the independent model fits data well, we simulate trait data under this model along one of the 500 a posteriori supported phylogenetic trees. Figure 3c,d depicts the result of the posterior model diagnostics applied to the simulated data. In contrast to the observed primate data, the simulated data do not exhibit disagreement between the observed and predicted discrepancies.

Figure 3

Testing coevolution. (a,c) Plots depicting observed (white bars) and predicted (grey bars) distributions of the discrepancy measure for the (a,b) primate and (c,d) simulated data. (b,d) The scatter plots of these distributions.

Disagreement between the observed and predicted discrepancies can be quantified using a tail probability, called a posterior predictive p value,Embedded Image(4.6)where the tail probability is taken over the posterior distribution of the independent model. In practice, given N MCMC iterations, one estimates posterior predictive p values viaEmbedded Image(4.7)where Embedded Image are the parameter values realized at iteration g, and Drep.g is a dataset simulated using these parameter values. Following this recipe, we estimate ppp for the primate data and the artificial data. The disagreement between the ‘observed’ and ‘predicted’ discrepancies is reflected in a low ppp=0.0128. By contrast, the ppp=0.3139, for the simulated data, supports agreement between the observed and predicted distributions of Δ.

(b) Mapping synonymous and non-synonymous mutations

In this section, we consider the important task of mapping synonymous and non-synonymous mutations onto branches of a phylogenetic tree. Our point of departure is a recent ambitious analysis of HIV intrahost evolution by Lemey et al. (2007), who use sequence data originally reported by Shankarappa et al. (1999). The authors attempt to estimate branch-specific synonymous and non-synonymous mutation rates and then project these measurements onto a time axis. This projection enables one to relate the time evolution of selection processes with clinical covariates. Lemey et al. (2007) find fitting codon models computationally prohibitive in this case. Instead, they first fit a DNA model to the intrahost HIV sequences, obtain a posterior sample of phylogenies with branch lengths, and then use these phylogenies to fit a codon model to the same DNA sequence alignment. Instead of fitting two different models to the data, we propose to use just a DNA model and exploit mapping summaries to predict synonymous and non-synonymous mutation rates.

Suppose we observe a multiple DNA sequence alignment Embedded Image of a protein-coding region with L sites and that all C=L/3 codons are aligned to each other such that the coding region starts at site 1 of D (in other words, there is no frameshift). Following the codon partitioning model of Yang (1996), we assume that sites corresponding to all first codon positions, Embedded Image, evolve according to a standard HKY model with generator Embedded Image, where κ1 is the transition–transversion ratio and π1 is the stationary distribution, both just for the first codon position appropriately constrained. Similarly, we define CTMC generators Embedded Image and Embedded Image for the other two codon positions with independent parameters. Assuming that all L nucleotide sites in D evolve independently together with the three codon position HKY models induces a product Markov chain model on the space of codons Embedded Image, where codons are arranged in lexicographic order with respect to our nucleotide order Embedded Image, the generator of this product CTMC isEmbedded Image(4.8)

With this Markov chain on the codon space, we define a labelling Embedded Image(s) that contains all possible pairs of codons that translate into the same amino acid. All other codon pairs are collected into a labelling set Embedded Image(n). Clearly, transitions between elements of Embedded Image(s) constitute synonymous mutations, and non-synonymous mutations are represented by transitions between elements of Embedded Image(n). In this manner, counting processes map synonymous and non-synonymous mutations onto specific branches of τ. We consider HIV sequences from patient 1 of Shankarappa et al. (1999) and approximate the posterior distribution of our DNA model parameters Embedded Image using MCMC sampling implemented in the software package BEAST (Drummond & Rambaut 2007). The serially sampled HIV sequences permit us to estimate the branch lengths T in units of clock time, months in this case. For each saved MCMC sample, we compute branch-specific rates of synonymous and non-synonymous mutations,Embedded Image(4.9)where we denote data at codon site c by Embedded Image. We also record the fraction Embedded Image of non-synonymous mutations. Similar to Lemey et al. (2007), we summarize these measurements by projecting them on the time axis. To this end, we form a finite time grid and produce a density profile of the synonymous and non-synonymous rates, and of the non-synonymous mutation fractions for each time interval between grid points (figure 4). Both synonymous and non-synonymous rate density profiles are consistently bimodal across time. Interestingly, the modes also stay appreciably constant. The density profile of the non-synonymous mutation fractions is multimodal and fairly complex. There are a considerable number of branches that exhibit strong negative Embedded Image and positive Embedded Image selection. In general, when comparing branches that occur earlier in the tree to those that occur later, the non-synonymous mutation fraction has first a modest upward trend through time and then descends to lower values, consistent with other patterns of evolutionary diversity reported by Shankarappa et al. (1999).

Figure 4

Time evolution of synonymous and non-synonymous rates. (a) A representative phylogeny of 129 intrahost HIV sequences. The three heat maps depict the marginal posterior densities of the (b) synonymous and (c) non-synonymous rates, and (d) the proportion of non-synonymous mutations over time.

Intrigued by the multimodality observed in figure 4, we investigate this issue further. Lemey et al. (2007) consider several branch categories in their analysis, e.g. internal and terminal. We decide to test whether differences exist between selection forces acting on internal and terminal branches of τ. We define the fractions of non-synonymous mutations on internal and terminal branches asEmbedded Image(4.10)andEmbedded Image(4.11)We plot their posterior histograms in figure 5. These histograms do not overlap, suggesting different fractions of non-synonymous mutations for internal and external branches. To test this hypothesis more formally, we form a discrepancy measureEmbedded Image(4.12)As in our previous example, we compare the observed discrepancy Embedded Image with the expected discrepancy Embedded Image, where Drep is a multiple sequence alignment simulated under the codon partitioning model with parameters Qcodon, τ and T. Evoking approximation (3.16) and recalling that our model assumes the same substitution rates for each branch of τ, we deduce thatEmbedded Image(4.13)for all parameter values Qcodon, T and τ and replicated data Drep. Plugging in our new discrepancy measures into equation (4.7), we find that ppp<0.001. Therefore, our posterior predictive test suggests that there is a significant heterogeneity in selective pressure among the branches of τ.

Figure 5

Bimodality of the fraction of non-synonymous mutations. We plot the predicted fraction of non-synonymous mutations computed for terminal (grey bars) and internal (white bars) branches.

5. Discussion

In this paper, we develop a computationally efficient framework for mapping evolutionary trajectories onto phylogenies. Although we aim to keep this mathematical framework fairly general, our main interest with evolutionary mappings lies in computing the mean number of labelled trait transitions and the mean evolutionary reward that depends linearly on the time a trait occupies each of its states. These two mapping summaries have been the most promising building blocks for constructing statistical tests.

We build upon our earlier work involving single branch calculations for Markov-induced counting processes (Minin & Suchard 2008). In our extension, we introduce single branch calculations for evolutionary reward processes and devise algorithms to extend single branch calculations to mapping expectations of counting and reward processes onto branches across an entire phylogeny. Our main result generalizes Felsenstein's pruning algorithm that forms the workhorse of modern phylogenetic computation. The generalized pruning algorithm warrants two comments about its efficiency for performing simulation-free stochastic mapping. First, when the infinitesimal rates are unknown, a traditionally slow component of phylogenetic inference is the eigen decomposition of their matrix. Fortunately, this decomposition finds immediate reuse in our algorithm to calculate posterior expectations of mappings. Second, the algorithm requires only two traversals of the phylogenetic tree, and is therefore at most two times slower than the standard likelihood evaluation algorithm. In practice, we find our algorithm approximately 1.5 times slower than the likelihood calculation. We achieve this advantage because during the second traversal n terminal branches are not visited. Finally, the Felsenstein's algorithm analogy yields a pulley principle for stochastic mapping and reduction in computation for prior expectations.

Our examples demonstrate how our novel algorithm facilitates phylogenetic exploratory analysis and hypothesis testing. First, we use simulation-free stochastic mapping of occupancy times to re-implement Huelsenbeck et al.'s (2003) posterior predictive test of independent evolution. In our second example, we attempt to recover synonymous and non-synonymous mutation rates without resorting to codon models and instead use an independent codon partitioning model. We overcome this gross model misspecification with stochastic mapping, find intriguing multimodality of synonymous and non-synonymous rates, and use a posterior predictive model check to test differences in selection pressures between terminal and internal branches. We stress that our predictions are only as good as the model we use. For example, the terminal/internal branch differences may be due to a general bad fit of our purposely misspecified model to the intrahost HIV data. However, we find this scenario unlikely in light of the recent demonstration of the excellent performance of codon partitioning models during analyses of protein coding regions (Shapiro et al. 2006).

Our examples illustrate importance of hypothesis testing in statistical phylogenetics. In recent years, it has become clear that an evolutionary analysis almost never ends with tree estimation. Importantly, phylogenetic inference enables evolutionary biologists to tackle scientific hypotheses, appropriately accounting for ancestry-induced correlation in observed trait values (Huelsenbeck et al. 2000; Pagel & Lutzoni 2002). Several authors demonstrate that mapping evolutionary histories onto inferred phylogenies provides a convenient and probabilistically grounded basis for designing statistically rigorous tests of evolutionary hypotheses (Nielsen 2002; Huelsenbeck et al. 2003; Dimmic et al. 2005). Unfortunately, this important statistical technique has been hampered by the high computational cost of stochastic mapping. Our general mathematical framework and fast algorithms should secure a central place for stochastic mapping in the statistical toolbox of evolutionary biologists.

Acknowledgments

We would like to thank Philippe Lemey for sharing his HIV codon alignments. We are also grateful to Andrew Meade for his help on using the BayesTraits software and Ziheng Yang and two anonymous reviewers for their constructive criticism. M.A.S. is supported by an Alfred P. Sloan Research Fellowship, John Simon Guggenheim Memorial Fellowship and NIH R01 GM086887.

Footnotes

  • One contribution of 17 to a Discussion Meeting Issue ‘Statistical and computational challenges in molecular phylogenetics and evolution’.

References

View Abstract