## Abstract

We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column.

## 1. Introduction

Tree reconstruction is nowadays a matter of routinely applying the available programs, comparing the resulting trees and then concluding what might be the best tree (e.g. Hillis *et al.* 1996; Felsenstein 2004; Schmidt & von Haeseler 2008). With the advent of Baysian approaches, it is possible to model increasingly complex evolutionary scenarios (Huelsenbeck & Ronquist 2001). However, a detailed understanding about what can and cannot be inferred (Baake 1998; Mossel & Steel 2005) from a sequence alignment is still missing. In recent years, some theoretical insights were obtained by studying very simple models of sequence evolutions on binary sequence data (e.g. Erdös *et al.* 1999*a*,*b*). Although these models are not realistic in a biological sense, they have provided some profound insights in the reconstruction process *per se*.

Here, we will introduce another description of the evolutionary process on trees. More precisely, given a phylogenetic tree and an alignment that evolved along the tree, we now ask the following question: how does the alignment change if an additional substitution on an arbitrary branch of the tree takes place? This rather abstract question is motivated by the following biological problem. Consider a collection of morphological traits that are either in an ancestral (0) or derived (1) state. Each derived character state characterizes a monophyletic group and represents a cluster in the tree. For such a data matrix (or alignment), the tree reconstruction problem is easy. However, stochastic effects that act somewhere on the branches of the tree may disturb this signal. This noise is modelled by the assumption of throwing an arbitrary number of changes on the tree and measuring their impact on the otherwise perfect data matrix. To this end, we construct a one-step mutation (OSM) matrix. The matrix description allows a linear algebra view of evolution and comprises distance methods, maximum parsimony as well as maximum likelihood. Moreover, the description reveals a surprising connection to Hadamard matrices that were employed for phylogenetic questions (Hendy & Penny 1989; Steel *et al.* 1998). An immediate application of OSM matrices is the analytical computation of posterior probabilities that count the number of evolutionary changes on a tree. So far, these posterior probabilities have been estimated using Bayesian simulation (Nielsen 2002; Huelsenbeck *et al.* 2003) or by applying the theory of counting processes (Minin & Suchard 2008).

In the following we refrain from most technicalities but rather outline the general ideas by discussing illustrative examples. The technical details and proofs will appear elsewhere.

## 2. The Binary model on an *n*-taxon tree

### (a) Notation

We consider a set of *n* taxa *S*={1, …, *n*}. With *S* comes some information about the common properties and differences of the taxa, typically displayed in an alignment. In the following a (sequence) *alignment* is an *n*×*ℓ*-array with entries either 0 or 1, where *ℓ* is the length of the alignment. Each of the *ℓ* columns (*sites*) *a*_{j} of the alignment represents a *pattern* of *n* homologous characters, where *a*_{ij}∈{0,1} is the state of character *j* in taxon *i*. For binary character states, 2^{n} patterns are possible.

We are interested in the evolution of such patterns along a (rooted) *tree T*=(*V*, *E*) with node set *V* and branch set *E*⊂*V*×*V* (Semple & Steel 2003). The node set *V* contains the taxon set *S* that forms the leaf set. Avoiding the technical details, each branch is uniquely encoded by the subsets *X* of *S* that originates from the branch. Such a set *X* specified by a branch will be called a *cluster*. A leaf is a trivial cluster.

Finally, we introduce a function *λ*:*E*→_{+}, such that *λ*(*e*)>0 represents the *length* of a branch *e*∈*E*. The *tree-length Λ*_{T} is the sum of the branch lengths. The *relative branch length*(2.1)denotes the probability that a substitution hits branch *e* of the rooted tree T.

### (b) The effect of substitutions on an alignment

We now describe how a single mutation on the tree changes the current character states at the leaves. Obviously the outcome will depend on the branch where the substitution occurred. Moreover, each of the 2^{n} possible patterns will be affected differently by such a substitution. Therefore we introduce a 2^{n}×2^{n} matrix that describes the action of a substitution on the patterns for a specific branch.

Figure 1 describes the model on an example tree T_{4} with four taxa. For instance, a substitution at the branch defined by cluster {A} changes the pattern 1011 to the pattern 1010 because only the character of taxon A is affected. Please note that order of taxa is (D, C, B, A) for each pattern. All possible changes between the patterns identified by a substitution on branch *e*_{A} are depicted in the substitution graph (figure 1*b*). The corresponding adjacency matrix *σ*_{A} is displayed in figure 1*c*, where a black square stands for one (the patterns are connected by an edge in the substitution graph) and a white square represents zero. The structure of matrix *σ*_{A} constitutes an example of the so-called *permutation matrices* with entries equal to one if the substitution converts one pattern into another (Bona 2004, p. 75).

For each branch we easily construct the corresponding permutation matrix. Without proof we state that each matrix is fix point free (a substitution changes every pattern) and has 2^{n−1} transpositions (applying a substitution twice returns the original pattern). Then it follows that this type of permutation matrix is self-inverse with respect to matrix multiplication and that they form an Abelian group if the identity matrix is included.

We point out that the permutation matrix for a non-trivial cluster is the product of the permutation matrices of its elements. In other words the action of one mutation on a branch *e* can be replaced by any partition of the cluster associated with *e*, such that each set of the partition is represented by a branch in the tree. For tree T_{4} we obtain six permutation matrices *σ*_{A}, *σ*_{B}, *σ*_{C}, *σ*_{D} and *σ*_{AB}=*σ*_{A}.*σ*_{B}, *σ*_{CD}=*σ*_{C}.*σ*_{D}.

Because we study symmetric permutation matrices, it is well known that the eigenvalues are either 1 or −1. Further, we observe that all those permutation matrices are diagonalized by the 2^{n}×2^{n}-dimensional Hadamard matrix (Hendy & Penny 1989; Hendy 2005). The 2×2 Hadamard matrix is defined as

The 2^{n}×2^{n} Hadamard matrix is thenwhere the symbol ⊗ represents the Kronecker product. Hence, for the cluster (*e*) associated with branch *e*,is the diagonal matrix of eigenvalues of .

To take the relative contribution of the branch lengths into account, we weight each permutation matrix with *p*_{e} as described in (2.1). Such matrices are a special case of the *generalized* permutation matrices. Then the so-called OSM matrix of the tree T_{4} is simply the following convex sum:Figure 2*c* shows the result of this computation. The substitution graph in figure 2*b* displays the effect of a substitution on all the branches of the tree on the patterns. Two patterns are connected by an edge if a substitution switches between the two patterns.

For an arbitrary phylogenetic tree T on *n* taxa, the OSM matrix is obtained by(2.2)where (*e*) is the cluster identified by branch *e*∈*E*. Owing to its construction, *M*_{T} is also diagonalized by the Hadamard matrix.

The entry *M*_{T}(** i**,

**) is positive if the tree T contains a cluster where a substitution on the corresponding branch implies that pattern**

*j***is changed to**

*i***. Hence, each row and each column has 2**

*j**n*−2 non-zero entries, one entry for each branch in the tree. Thus, the OSM matrix belongs to the class of doubly stochastic Markov transition matrices, where the relative branch lengths are represented exactly once in each row and each column. Consequently, the

*k*-th power provides the probabilities to move from one pattern to another in

*k*substitutions. Thus, the repeated application of

*M*

_{T}describes a

*random walk*on the state space of the 2

^{n}patterns. This random walk is very different from the standard random walk on the hypercube (Eigen

*et al.*1989). If

*k*is large, then will lose the phylogenetic information of the original alignment and will approximate the uniform distribution, that is each pattern occurs at the same frequency.

### (c) Poisson weights

Our setting does not yet assume a probability distribution for the number of substitutions on the tree. In molecular evolution, one typically assumes that the number of substitutions is Poisson distributed with parameter *Λ*_{T} (Uzzell & Corbin 1971). Under this assumption we can compute the average OSM matrix by(2.3)which is equivalent toThe exponential of the matrix *Λ*_{T}*M*_{T} is easy to compute, because *M*_{T} is a sum of generalized permutation matrices (2.2), which commute with respect to matrix multiplication. Thus, we obtain(2.4)

## 3. Relation to tree reconstruction

The OSM matrix leads to a very general description of character-based phylogenetic inference techniques. Moreover, the explicit model assumptions in maximum likelihood and the implicit assumptions in maximum parsimony are directly comparable.

The OSM matrix and its powers describe the substitution process between arbitrary patterns. However, in classical phylogeny the starting point of a substitution process is ancestral states on trees. In particular, one assumes a stationary distribution *π*=(*π*_{0}, *π*_{1}) of character states at the root, and the characters evolve along the tree according to a Markov transition matrix (Tavaré 1986). In our framework, this is equivalent to starting in the *constant* patterns **0**=(0, …, 0) or **1**=(1, …, 1) and letting it evolve according to the OSM matrix. This process has a non-stationary pattern distribution which starts at , i.e. with zero substitutions only constant patterns exist, and in each step the pattern distribution is given by . If the number of substitutions is not weighted as in equation (2.3), then will approach the uniform distribution as *k* goes to infinity. To overcome the loss of phylogenetic signal, we assume in the following that the number of substitutions on a tree is Poisson distributed with parameter *Λ*_{T}. Moreover, we assume that the substitution process is described by the symmetric Cavender–Farris–Neyman mutation model (CFN, Neyman 1971; Farris 1973; Cavender 1978). Under these assumptions, the probability of observing pattern ** a** when starting in a constant pattern is then calculated employing (2.3)(3.1)where

*π*

_{0}and

*π*

_{1}are taken from the stationary distribution of character states. The resulting probability distribution for all possible patterns is then identical to the standard way of computing the probabilities of pattern (Felsenstein 2004).

### (a) Distance approaches

Now, we briefly illustrate how to derive distance corrections from the OSM matrix. To this end, we consider the rooted tree with two leaves *A, B* and branch lengths *λ*_{A} and *λ*_{B}. Then the corresponding OSM matrix *M*_{2} has the following structure:where *p*_{A} and *p*_{B} are computed according to (2.1). From (2.4) it is straightforward to compute using (*).

The first and the last row of yield the probabilities to observe one of the four patterns. If evolution starts with character state 0 or 1 at the root of the tree and the character states are in equilibrium (*π*_{0}=*π*_{1}), then we quickly computeas the probability to observe a constant pattern in an alignment. Similarly we compute the probability to observe different character states between taxa A and B. From this it is straightforward to get the distance correction of the CFN model.

### (b) Maximum likelihood

The maximum-likelihood principle for an alignment and a tree T is easily formulated in terms of the OSM matrix. We introduce as parameter vector ** θ** the branch lengths of T. Then the probability of is given by(3.2)where the factors on the right-hand side are defined by equation (3.1). The parameter vector

**enters the equation via the OSM and in the obvious way. As usual, we want to find parameter assignments such that (3.2) is maximized.**

*θ*### (c) Maximum parsimony

We associate the adjacency matrix *A*_{OSM} (e.g. Cormen *et al.* 2001, §22.1), or simply ** A**, with the OSM matrix.

**is obtained as the unweighted sum of the permutation matrices . Hence, an entry**

*A*

*A*_{ij}is equal to one when there is a branch in the tree which changes pattern

*i*into pattern

*j*, and is zero otherwise. Finally, we note that

*A*^{k}(

*i*,

*j*) describes the number of paths of length

*k*between pattern

*i*and

*j*. Each path specifies a series of branches in the tree where a substitution occurred.

Now, fix a column *a*_{i} in alignment , and a tree T. We ask for the minimal number *k*_{min} such that or is greater than zero. In other words, for an alignment column *a*_{i}, the minimal number of mutations on T equalsThus the minimal number of mutations for an alignment equals(3.3)

This is another description of the maximum-parsimony principle.

## 4. Mapping substitutions

From the computation of the powers of the OSM matrix, it is possible to derive the (posterior) probability distribution ppdf(*k*|** x**) of the number of mutations that generated an observed pattern

**, when the process started in pattern**

*x***0**or

**1**. The posterior probabilities have been estimated before, employing Bayesian simulation methods (Nielsen 2002; Huelsenbeck

*et al.*2003; Minin & Suchard 2008), but an analytic approach has not previously been attempted.

In general, the posterior probabilities ppdf(*k*|** a**) for a pattern

**are calculated in the following way using (3.1):i.e. we compute for pattern**

*a***the proportion of its occurrence after**

*a**k*substitutions.

For the two-taxon case, we observe thatThus, applying hyperbolic identities, we end up withTaylor expansion of cosh(*x*) around *x*=0 then leads to the posterior probability

For the remaining patterns, we obtain the following ppdf:

Thus, on a two-taxon tree, an even number of substitutions leads to a constant pattern when starting in a constant pattern; whereas an odd number of substitutions will be reflected by the non-constant pattern 01 or 10.

Now, it is an easy calculation to obtain closed formulae for the posterior mean number *μ*(** a**) of substitutions for pattern

**by the following calculations:Only if**

*a**Λ*is large, then the posterior mean number of substitutions will approach the expected number of substitutions per site

*Λ*. For a constant pattern the posterior mean is always smaller than

*Λ*, and for non-constant patterns the posterior mean is larger than

*Λ*.

Similarly we extend the calculations to a four-taxon tree. For instance, consider the four-taxon tree T_{4} (figure 1*a*). This tree has two non-trivial clusters {A, B} and {C, D}. We want to compute the posterior probability of the number of substitutions if the constant pattern 0000 is observed. Let us assume that the two character states occur with uniform probability; then we can compute:where is the sum of the lengths of the interior branches. Now Taylor expansion leads to the desired posterior probability distribution.

Figure 3 shows the resulting posterior probability distributions for the 16 possible patterns, assuming branch lengths and . The symmetries in the CFN model are reflected in the symmetries of the posterior distributions. Complementary patterns (i.e. 0000 and 1111) show the same distribution. Because the tree is clock-like, the parsimonious uninformative patterns (0001, 0010, 0100, 1000) and their complements show identical distributions, as do the patterns that need at least two substitutions (0101, 0110, 1010, 1001) on T_{4}. Posterior probabilities may be used to compute, for instance, the number of unvaried sites (Fitch & Ayala 1994), which is exactly the proportion of the constant patterns with zero substitutions. In our example we expect approximately 42 per cent constant patterns of which approximately 90 per cent are unvaried. This is only one application for posterior probabilities of the number of substitutions.

As in the two taxon case, we compute the posterior mean of substitutions for pattern ** a** asFigure 4

*a*shows the posterior mean number of substitutions for the topology of T

_{4}with branch probabilities and for a constant pattern (0000), a pattern compatible with an interior branch (0011) and a pattern incompatible with the tree (0110). The difference between posterior mean and tree lengths is smaller than 0.01 if the tree lengths exceed 10 substitutions per site.

Figure 4*b* displays the posterior means for a tree with branch probabilities , and *p*_{AB}=*p*_{CD}=0.01. The proportion of *p*_{A} and *p*_{D} is so large that the incompatible pattern 0110 will be observed more often than the pattern 0011, which is compatible with a branch of the tree. Thus, this tree is an instance where maximum parsimony will reconstruct the wrong tree (Felsenstein 1978). The figure also shows that the compatible pattern 0011 has a lower posterior mean number of substitutions than 0110 for short tree lengths. However, if the tree length exceeds 1.64 substitutions per site, then the situation is reversed. The posterior mean of the incompatible pattern quickly approaches the tree length, whereas the mean posterior substitutions of the compatible pattern are only close to the tree length if *Λ*≥54 substitutions per site. In other words, if we observe a compatible pattern, then this pattern has typically experienced more substitutions than the incompatible pattern.

## 5. Summary and discussion

Here, we have presented an alternative description of how to model sequence evolution on a tree. Our approach lifts the commonly used stochastic models of sequence evolution that act on nucleotides to the set of all possible patterns for *n* taxa.

We have shown that available tree reconstruction principles are included in our description of the process. Moreover, the definition of the OSM matrix leads to analytical formulae to compute the posterior probability distribution of the number of substitutions for each pattern. From this distribution it is then straightforward to compute the posterior mean of the substitutions. The formulation of the substitution process as an OSM matrix leads to the introduction of the Hadamard matrix that allows an easy computation of matrix powers. Recently, Bryant (submitted) has presented a continuous version of the OSM approach and showed its connection to the Hadamard matrix (Hendy 2005).

While we have outlined only the simplest model of sequence evolution, several extensions are easily possible. The OSM approach can be augmented to the Kimura 3st model (Kimura 1981); see figure 5*a* for an illustration. In this framework every substitution class (transition, transversion 1 and transversion 2) uniquely generates a fix-point free 4^{n}×4^{n}-dimensional permutation matrix for each branch in a tree. Let *α*_{1}+*α*_{2}+*α*_{3}=1 denote the probabilities for the three substitution classes, then the OSM matrix for the Kimura 3st model is defined as(5.1)i.e. we look at the sum of generalized permutation matrices. Figure 5*a* shows an OSM Kimura matrix on a rooted triplet tree. Each row and each column contain 12=(number of branches)×(number of substitution classes) non-zero entries, where each entry is the product of a mutation class parameter *α*_{i} and a branch probability *p*_{e} (equation (5.1)). All the results for binary character state models can be expanded to the Kimura 3st model. If one wants to abandon the assumption that evolution proceeds along a tree, then this is also possible within the OSM framework. Consider a set of rooted trees that give rise to a collection of possibly conflicting clusters. The associated OSM matrix is then given bywhere is the normalized sum of branch lengths of those trees in which the branch depicting is existent. Here issues such as the meaning of the overall length of the cluster set or the meaning of a root in such sets need to be discussed. This extension bears some similarity to a maximum-likelihood reconstruction of networks (von Haeseler & Churchill 1993).

Another question concerns the Poisson weights for the number substitutions. Generally, the argument is that the process of distributing substitutions along a tree is memoryless and therefore the number of substitutions is Poisson distributed. Our framework permits a different probability distribution to be assigned to the substitution process. One possible weighting scheme could be a contagious distribution, which has been used to evaluate accident data (Kemp 1967). This approach might provide an alternative description of the evolutionary history of an alignment.

In summary, the OSM description offers a variety of potential applications in molecular systematics, which will be explored in the near future.

## Acknowledgements

We are grateful to a number of colleagues who helped to increase our understanding of the model by asking the small questions necessary to get that kind of understanding. Helpful comments by two anonymous reviewers are gratefully acknowledged. Support by the Wiener Wissenschafts-, Forschungs- und Technologiefonds (WWTF) is kindly acknowledged. T.G. and A.v.H. are also supported by the Austrian Ministry for Science and Research (GEN-AU Bioinformatics Integration Network II).

## Footnotes

One contribution of 17 to a Discussion Meeting Issue ‘Statistical and computational challenges in molecular phylogenetics and evolution’.

- © 2008 The Royal Society