## Abstract

Different types of nucleotide substitutions experience different patterns of rate change over time. We propose clustering context-dependent (or context-independent) nucleotide substitution types according to how their rates change and then using the grouping for divergence time estimation. With our models, relative rates among types that are in the same group are fixed, whereas absolute rates of the types within a group change over time according to a shared relaxed molecular clock. We illustrate our procedure by analysing a 0.15 Mb intergenic region to infer divergence times relating eight primates. The different groupings of substitution types that we explore have little effect on the posterior means of divergence times, but the widths of the credibility intervals decrease as the number of groups increases.

This article is part of the themed issue ‘Dating species divergences using rocks and clocks’.

## 1. Introduction

Divergence time estimation methods that accommodate rate variation over time are diverse [1–4] and are now referred to as relaxed clock procedures. More attention has been devoted to rate variation over time than to the fact that not all kinds of nucleotide substitutions experience the same pattern of rate variation. Most notably, the rate in mammals at which the CG dinucleotide changes to TG (conventionally referred to as CpG transitions) is high and, when compared with other substitution types, relatively constant [5–7]. The approximately clock-like behaviour of CpG transitions is hypothesized to stem from a spontaneous mutation mechanism that can operate during the majority of the mammalian lifecycle [6]. Presumably because of this mechanism, CpG transitions accumulate in proportion to the amount of elapsed chronological time. In contrast, other types of point mutations might be more strongly associated with meiosis and therefore be likely to occur in a relatively small period during each generation. Even if the rate per generation of these meiotic-dependent mutations is constant, the mutation rate per year will change as generation length changes.

To reflect variation among substitution types in how rates change, one possibility is each kind of nucleotide substitution having its own independent relaxed clock. A shortcoming of this is that some kinds of mutation will share mechanisms or environmental influences and therefore tend to have highly correlated patterns of rate change. In an earlier work [8], we analysed primate sequence data and allowed each of nine substitution types to have its own independent relaxed clock. We concluded that some of the credibility intervals for divergence times were overly narrow, possibly because of failure to account for correlated changes in evolutionary rate among substitution types.

Here, we modify our divergence time procedure in the hope that a more realistic treatment of relaxed clocks will emerge. The modification assumes that each type of substitution belongs to a group. A group may consist of only a single substitution type or it may encompass many. Substitution types within a group all change rate in the same way. Rates of substitution per year can vary over time, and rates of substitution types within the same group can differ, but relative rates of substitution for types belonging to the same group are assumed to be invariant. This means that the number of independent relaxed clocks is equal to the number of groups rather than the number of substitution types. Our task is to assign predefined substitution types to groups and then use a selected mapping of types to groups for divergence time estimation.

## 2. Methods

Consider *K* substitution types where each type *k* (*k* = 1, 2 , … , *K*) specifies the starting nucleotide, the context of the starting nucleotide and also the nucleotide that is introduced owing to substitution. For example, if CpG context is considered for a strand-symmetric model, then there are *K* = 9 substitution types (table 1). While strand symmetry is not necessary for our methods, we make this simplifying assumption about the substitution process here, because our prime motivation is to capture the effects of CpG context, and a CpG dinucleotide in one DNA strand is also a CpG dinucleotide in the complementary strand.

Let *g*(*k*) be the group number of substitution type *k* with where *N _{G}* is the number of groups and

*N*≤

_{G}*K*. The number of types in group

*g*will be

*S*. This means We cluster types so that each group has its own independent relaxed clock. Within a group, the relative rates of substitution types are fixed over time. Therefore, if the rate for the group changes by a factor of 2, then all substitution types within the group change by that factor.

_{g}With sequence data alone, the product of nucleotide substitution rate and time can be estimated, but rates and times cannot be disentangled. We refer to the product of time and rate of a type of nucleotide substitution on a branch as a ‘substitution length’ [8]. Rather than employing notation for the time duration of each branch, the confounding of rate and time makes it convenient to introduce the estimation of substitution lengths by (temporarily) treating the time duration of each branch to be 1, so that knowledge of the substitution length yields the substitution rate.

We write the substitution length of type *k* on branch *j* as where *c _{k}* is the relative rate of substitution type

*k*and is the substitution rate per site on branch

*j*for group

*g*(

*k*). We arbitrarily set the relative rate of one substitution type per group to 1, and thus, there are free parameters for relative rates. Suppose there are

*B*branches on a rooted phylogeny. With the most general case where the

*K*substitution rates evolve independently, there are

*K*×

*B*parameters for substitution lengths. If substitution types are clustered into

*N*<

_{G}*K*groups, then there are parameters for substitution lengths.

### (a) Augmented data likelihood

The complete history of which substitutions occurred at which times on a phylogeny is typically not directly observed, and instead, only the endpoints of evolution (i.e. the sequences at the tips of a phylogeny) are typically available. However, inferences about the substitution process are often more straightforward when the complete history is directly observed. We therefore introduce our approach for the situation where the history is directly observed. Later, we note that the observed sequence information at the tips (endpoints) of a phylogeny can be augmented to yield a complete history, and we explain our procedure for treating the uncertainty associated with this augmentation.

We start by considering a single branch *j* on the rooted phylogeny. Assume the full substitution history on branch *j* is known, and we are interested in the substitution length of type *k* on this branch. Motivated by evidence that frequencies of CpG sites have decreased over time in some evolutionary lineages [9], we do not assume stationarity of the substitution process. To represent all the parameters for branch *j* and the values of all relative rates *c _{k}*,

*μ**will be used.*

_{j}Each of the *K* = 9 substitution types in table 1 specifies a beginning nucleotide and context. Let *b*(*k*) be the beginning nucleotide and context of type *k*. For instance, *b*(1) is a non-CpG site with starting nucleotide G or C. The three possible *b*(*k*) if CpG context is considered are: non-CpG G or C; non-CpG A or T and CpG C or G. Let be the proportion of time site *i* has context *b*(*k*) on branch *j*. We will refer to the as dwell proportions. Let *n _{kij}* be the number of type

*k*changes at site

*i*on branch

*j*. With substitution counts denoted and the summed dwell proportions denoted the log-likelihood for an observed history on

*j*is 2.1 where

**n**

*,*

_{j}

*ϕ**and*

_{j}**s**

_{j}_{0}are respectively vectors of the substitution counts

*n*, the dwell proportions and the initial states

_{kj}*s*

_{ij}_{0}.

Suppose we are interested in the substitution length of a type *k** that is in group *g*(*k**). Let *G** be the set of substitution types that are in group *g*(*k**). By setting the first derivative of the relevant log-likelihood equal to zero, the maximum-likelihood estimates of are
2.2
Because *c _{k}* is constant among branches, we need to gather relevant information for

*c*from all branches to solve for its maximum-likelihood estimate. Because the root is assumed to provide no information about substitution lengths, the log-likelihood for the completely observed history is proportional to 2.3

_{k}*N*of the

_{G}*c*s will be constrained to 1 (i.e. one per group). For the other

_{k}*c*s in each group, the maximum-likelihood estimate is 2.4

_{k}### (b) Parameter estimation

#### (i) Unweighted procedure

For augmented data, equations (2.2) and (2.4) show that maximum-likelihood estimates can be obtained when the *c _{k}* are known and maximum-likelihood estimates can be obtained when the are known. This suggests parameter estimation via successive substitution. Starting with initial guesses of the

*c*, equation (2.2) can be applied to estimate the . Then, the estimates substituted into equation (2.4) generate new values of the

_{k}*c*. This cycle of successive substitution can be continued until the values of both kinds of parameters converge.

_{k}The sufficient statistics in equations (2.2) and (2.4) are the dwell times *ϕ _{b}*

_{(k)j}and the substitution counts

*n*. Typically, neither is directly observed. Lee

_{kj}*et al*. [8] estimated these sufficient statistics by sampling endpoint-conditioned histories using context-independent substitution models and assuming these histories well approximated a sample of endpoint-conditioned histories from a context-dependent model. Simulation suggested that this approach was reasonable for the primate data analysed by Lee

*et al*. [8]. If we assume that endpoint-conditioned context-independent histories are a good approximation of context-dependent histories, then the sufficient statistics

*ϕ*

_{b}_{(k)j}and

*n*can be approximated by their sample means among histories. These sample histories can then be used in successive substitution in equations (2.2) and (2.4) to yield parameter estimates. Alternatively, the sufficient statistics from an individual sequence history together with successive substitution can yield substitution length estimates based only on that history. Then, the sample mean among histories can be reported. This latter procedure also facilitates approximation of the variances and covariances of substitution length estimates [8] and is what we refer to here as the ‘unweighted procedure’.

_{kj}#### (ii) Importance sampling procedure

To account for the possibility that endpoint-conditioned context-independent histories do not well approximate context-dependent ones, an importance sampling procedure could be employed along with an expectation-maximization (EM) algorithm. First, we outline the EM algorithm.

Let **X** be the DNA sequences observed at the tips of a tree with known rooted topology and **Z** be the (unobserved) substitution history that specifies exactly which sequence changes occurred at which times on the phylogeny. The sequence at the root will also be specified by **Z**. Because **X** represents a subset of the information provided by **Z**, the complete data are (**X**, **Z**) = **Z**. Because the complete-data log-likelihood in equation (2.3) is in the form of an exponential family, an EM algorithm for parameter estimation could have the form

(1) E-step: estimate the sufficient statistics

*T*=*T*(**X**,**Z**) by where represents values of the parameters (*μ*) at the*r*th iteration and*T*^{(r)}represents estimated sufficient statistics at the*r*th iteration.(2) M-step: solve for new by plugging

*T*^{(r)}into equations (2.2) and (2.4) via successive substitution.

We do not have analytical formula to compute expectations of the sufficient statistics in the E-step. Instead, we can approximate the expectations by averaging over a large number of sampled histories. To reduce the computational cost in this Monte Carlo EM (MCEM) algorithm, we can recycle the same sampled histories for each MCEM iteration via importance sampling (see [10]) by drawing a sample of approximately independent histories from a context-independent substitution model. At each iteration of the EM algorithm, instead of obtaining a new sample of histories using the most recent estimates of parameter values (substitution lengths), we apply importance weights to histories in the original context-independent sample.

Let *Z*^{(c)} be the *c*th substitution history (of *C* total) sampled from *Pr*(**Z**|**X**, *μ**), where *μ** are the parameters from the context-independent model. Let *μ* be all parameters in the context-dependent model of interest. The EM algorithm iterates between approximation of the expected complete-data likelihood
2.5
and maximization of the over *μ*. With importance sampling, the E-step becomes
2.6
with To calculate *w*_{c}, the times of all branches are set to 1 for the probability densities in both the numerator and denominator. This means the rates on branches are adjusted in order to specify the substitution lengths in the numerator and the branch lengths in the denominator.

With this MCEM approach, the variance–covariance matrix of estimates can be approximated by approximating the inverse observed information. Although we omit the details here, this can be done by adapting the procedure of Wei & Tanner [11] to incorporate the *w*_{c} values.

An overview of the unweighted and importance sampling procedures for parameter estimation is given in figure 1.

### (c) Approximating Akaike information criterion in the presence of incomplete data

For a given grouping of substitution types, MLEs can be obtained with the MCEM algorithm that we have described. To select among the alternative groupings, we can try to find the one with the minimum Akaike information criterion (AIC)
2.7
where *d* is the number of parameters in the grouping [12]. We can approximate the observed data likelihood,
2.8
Suppose we estimate and under two different models, model 1 and model 2, using the MCEM algorithm. Then we can evaluate these models by approximating the AIC difference
2.9
where *d*_{1} and *d*_{2} are respectively the numbers of parameters in models 1 and 2.

## 3. Primate data and analysis details

Lee *et al*. [8] employ multidivtime software (see [13]) and the rate evolution model of Kishino *et al*. [3] to estimate divergence times. Lee *et al*. [8] give each substitution its own independent relaxed clock, whereas multidivtime was originally developed so that each gene in a multigene dataset could have its own independent relaxed clock. Our extension also employs multidivtime, but has each group rather than each gene or each substitution type with its own independent relaxed clock.

For the sake of comparison with Lee *et al*. [8], we analyse the same data as did that study. Specifically, we use an approximately 0.15 Mb subset of data from Kim *et al*. [6] that corresponds to an intergenic part of human chromosome 16 with CpG islands masked by Repeatmasker [14]. The data represent nine primates related by a well-established topology that was treated as known (figure 2). One of the nine primates (bushbaby) is the outgroup and was only used in our analyses to root the ingroup.

To sample a sequence history (i.e. a substitution history on the tree for each site in the sequence), the context-independent GTR substitution model [17,18] and the unrooted version of the tree topology in figure 2 was used. Maximum-likelihood estimates of the GTR rates and branch lengths were obtained by PAMLX software [19]. Fixing the GTR rates and branch lengths at these estimates, a slightly modified version of the PhyloBayes-MPI software [20] was employed to generate 400 independently sampled endpoint-conditioned sequence histories. The endpoint-conditioned procedure implemented in PhyloBayes-MPI is the stochastic mapping procedure of Nielsen [21] with the exception that uniformization for a sequence site is employed if many consecutive attempts of Nielsen's procedure fail to match the observed endpoints for the site (see [22]).

The PhyloBayes histories are sampled so as to include the history on the path connecting the ingroup root to the outgroup. Because multidivtime does not use them, we do not attempt to estimate substitution lengths on this path or to carefully model it. Even when accounting for context-dependence on the rest of the tree, we assume that this path to the outgroup evolves via the GTR model when applying the importance sampling procedure or the AIC approximation. This assumption simplifies importance sampling because it means that terms pertaining to the path to the outgroup are identical in the numerator and denominator of the ratio that defines *w*_{c} (see equation (2.6)) and therefore these terms cancel.

Here, all 400 endpoint-conditioned sequence histories were sampled with the GTR model, homogeneity of rates among sites, and fixed GTR and branch length parameters. Adding gamma-distributed rate heterogeneity and other context-independent improvements to the substitution model has the potential to make paths sampled by a context-independent model better approximate the endpoint-conditioned distribution from a context-dependent model. To examine whether incorporating enhancements to the context-independent model would impact divergence time estimates, we also applied the ‘unweighted procedure’ to obtain substitution lengths from the 312 context-independent sequence paths that were sampled by Lee *et al*. [8]. These 312 paths were generated from the primate dataset with the CAT–GTR model and a four-category discretized gamma treatment of rate heterogeneity (see [8]).

For the divergence time analyses, priors are selected to correspond to those of Lee *et al*. [8] when possible. We tightly constrain the age of the ingroup root (node 14 in figure 2) to have a gamma prior distribution with mean 44.2 million and standard deviation 0.1 million years. This mean age is the value suggested by the TimeTree database [15,16]. A more elaborate analysis would carefully incorporate all pertinent fossil evidence, but we instead constrain node 14, so that the results here can be more easily compared with those of Lee *et al*. [8]. Again following Lee *et al*. [8], each group of substitution types has a lognormal relaxed clock with an autocorrelation parameter *v* that has a gamma prior with mean and standard deviation both 0.023 (the inverse of the prior mean of the root time). Lee *et al*. [8] used a gamma distribution with mean and standard deviation both 0.0012 substitutions per million years as the prior distribution for the substitution rate at the ingroup root. We use the same prior for the values at the ingroup root. Each multidivtime analysis was run 2 × 10^{7} proposal cycles with 50% of iterations treated as burn-in, and each set of conditions was run twice in order to check convergence.

We decided to explore five of the possible groupings of the *K* = 9 substitution types that are listed in table 1. These are (i) each of the types in its own group (the ‘nine-groups’ grouping); (ii) all types in the same group (the ‘one-group’ grouping); (iii) the ‘four-groups’ grouping that separates transitions from transversion and also separates CpG and non-CpG sites, so that the four groups place the substitution types from table 1 into subsets {1,2,3,4}, {5,6}, {7,8} and {9}; (iv) the ‘three-groups’ grouping that differs from ‘four-groups’, because all transversions are in the same subset {1,2,3,4,7,8}; and (v) the ‘two-groups’ grouping that has CpG transitions (i.e. type 9) in one group and the eight other substitution types in the only other group. For each group in each of the five possible groupings that were explored, the substitution type corresponding to the lowest number in table 1 had its *c _{k}* value constrained to 1.

## 4. Results and discussion

Each of the 400 sequence histories can be separately used to infer substitution lengths. When doing this, we find little variation among estimates and also observe that the history that achieves the maximum of the 400 *w*_{c} values does not tend to yield estimates that are outliers (figure 3). While histories at individual sequence sites may suggest substantially different substitution lengths among the 400 sampled site histories, the primate sequence alignments are long, so that the mean effect per site in a sequence history is likely to vary little among sequence histories.

Whereas substitution length estimates reflect the average behaviour of sites in a sequence history, the *w*_{c} values reflect the cumulative behaviour of sites. The variance of an average of random variables decreases as sample size increases, but the variance of a product can increase as sample size increases. We think this cumulative effect is why we observe that one of the 400 *w*_{c} values tends to greatly exceed all others. For example, after the EM algorithm for the nine-groups grouping converged, the ratio of the second biggest of the 400 *w*_{c} values to the biggest was about 0.003.

As expected, reducing the alignment length to 10^{4} or 10^{3} columns resulted in less disparity between the maximum *w*_{c} value and the others. However, even with 10^{4} or 10^{3} columns, the maximum tended to be much bigger than all others. With the reduced alignment lengths, different regions achieved maximum *w*_{c} values with different of the 400 histories. These observations indicate that our approach of weighting sequence histories does not approximate well the distribution of histories that would be obtained if all were directly sampled with a context-dependent model via Markov chain Monte Carlo or some other algorithm.

While these *w*_{c} results demonstrate that differences between context-dependent and context-independent history samples are easy to detect, they do not demonstrate that substantive problems will arise when using a sample of context-independent histories to infer divergence times with a context-dependent model. Although the context-independent model for sampling histories was different in the earlier study, the simulations of Lee *et al*. [8] suggest the difference in distributions of histories between a context-independent and a context-dependent model is not a major issue with the primate dataset. Still, the effect of using context-independent histories for context-dependent divergence time inference will be dataset-dependent. A better and more general procedure is needed.

Because of the extreme *w*_{c} values, the importance sampling procedure for estimating divergence times was not pursued for the primate data. In addition, failure to obtain a more satisfactory distribution of *w*_{c} values means that the outlined AIC procedure for approximating AIC differences is unreliable. This finding is reminiscent of the instability of the harmonic mean estimator for approximating Bayes factors (see [23]). The outlined procedure may be effective when histories are sampled from a probability density that is closer to the models being evaluated. For example, the procedure might succeed when histories are sampled according to one grouping of substitution types and that grouping is being compared with another.

We inferred divergence times for the five different groupings (one-group, two-groups, three-groups, four-groups and nine-groups) with the unweighted procedure. Table 2 summarizes divergence time posteriors that are based on the sample of 400 sequence histories according to the GTR model with rate homogeneity among sites. In table 2, credibility intervals become narrower as the number of relaxed clocks (i.e. groups) increases. This coincides with the pattern discussed by Zhu *et al*. [24] about credibility intervals shrinking with multilocus studies as the number of relaxed clocks increases. Our results suggest that attention should be paid to how correlated are patterns of rate change over time among substitution types.

The *c _{k}* values and inferred substitution lengths are included as the electronic supplementary material. If we compare the ratios of estimated

*c*values for two substitution types that are in the same group, we find that these ratios are not very sensitive to how the other substitution types are grouped. For example, substitution types 1 and 2 are in the same group for our one-group, two-groups, three-groups and four-groups models. In each of these models, the ratio of the estimated

_{k}*c*for type 2 relative to type 1 is close to 1.4. However, we find that inferred substitution lengths on a branch are less robust to how other substitution types are grouped. As expected, substitution lengths for

_{k}*CpG*transitions are substantially larger than substitution lengths for other substitution types.

Results from the 312 histories of Lee *et al*. [8] that relied on a comparatively realistic context-independent model again show that widths of credibility intervals shrink as the number of independent relaxed clocks increase. While the posterior means of divergence times from the two sets of samples differ somewhat (table 2), it is difficult to draw a general conclusion from the inferred times about the effects of improving the context-independent substitution model used for sampling. Presumably owing to the rate heterogeneity employed when sampling the 312 histories, we do observe that substitution lengths inferred from the set of 312 histories are larger than those from the set of 400 histories.

While the focus here has been on grouping substitution types, so that types in each group share an independent relaxed clock, an alternative would have relaxed clocks vary among substitution types, but not necessarily be independent of one another. This sort of model is reminiscent of the work by Lartillot & Delsuc [25], who have been exploring rate change via multivariate Brownian processes on trees (see also [26]). Even more advanced models could have the correlation structure among rates of substitution types change over time.

Divergence time estimation is one reason why understanding and handling rate variation over time is warranted. To accurately determine divergence times when fossil records are sparse, a satisfactory treatment of molecular evolution is particularly desirable. Beyond divergence time inference, the process by which genomes change is of fundamental biological interest and so is characterizing how this process itself evolves. A better understanding of how different kinds of substitutions change rate over time will facilitate this characterization.

## Data accessibility

The software developed for this research is freely available at https://github.com/HuiJieLee. The aligned sequence data used in this study are available at Dryad, http://doi.org/10.5061/DRYAD.2TS15.

## Authors' contributions

H.-J.L. and J.L.T. designed the study and wrote the manuscript with substantial contributions from H.K. and N.R. H.-J.L. and N.R. wrote the computer programs. H.-J. L. performed the analysis with substantial contributions from J.L.T. All authors gave final approval for publication.

## Competing interests

The authors declare no competing interests.

## Funding

H.-J.L. and J.L.T. were supported by N.I.H. grants GM090201 and GM070806. J.L.T. was also supported by N.I.H. grant GM118508. H.K. was supported by the Japanese Society for the Promotion of Science Grant Number 25280006. N.R. was supported by the Natural Science and Engineering Research Council of Canada.

## Acknowledgements

We thank Dr Mario dos Reis and an anonymous reviewer for their help.

## Footnotes

One contribution of 15 to a discussion meeting issue ‘Dating species divergences using rocks and clocks’.

- Accepted January 7, 2016.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.