## Abstract

RNA viruses account for numerous emerging and perennial infectious diseases, and are characterized by rapid rates of molecular evolution. The ecological dynamics of most emerging RNA viruses are still poorly understood and difficult to ascertain. The availability of genome sequence data for many RNA viruses, in principle, could be used to infer ecological dynamics if changes in population numbers produced a lasting signature within the pattern of genome evolution. As a result, the rapidly emerging phylogeographic structure of a pathogen, shaped by the rise and fall in the number of infections and their spatial distribution, could be used as a surrogate for direct ecological assessments. Based on rabies virus as our example, we use a model combining ecological and evolutionary processes to test whether variation in the rate of host movement results in predictive diagnostic patterns of pathogen genetic structure. We identify several linearizable relationships between host dispersal rate and measures of phylogenetic structure suggesting genetic information can be used to directly infer ecological process. We also find phylogenetic structure may be more revealing than demography for certain ecological processes. Our approach extends the reach of current analytic frameworks for infectious disease dynamics by linking phylogeography back to underlying ecological processes.

## 1. Introduction

Ecological and evolutionary processes are inevitably and intrinsically linked [1]. However, the importance of coupled ecological–evolutionary interaction is often underappreciated owing to a general perception that ecological and evolutionary processes operate on vastly different time-scales. Evolutionary processes are often thought of as occurring over centuries to millennia while ecological dynamics are measured in months or decades. The discordance of time-scale leads to a perception that ecological and evolutionary processes can be treated independently. However, theoretical results that embrace both sets of processes consistently point towards the importance of ecological and evolutionary interactions even when rates are dissimilar [2–5]. Several recent studies also suggest that comparable ecological and evolutionary rates for plants and animals may be more common than previously assumed [6–9].

Interactions between ecological and evolutionary processes are nowhere more evident than in the dynamics of infectious diseases. Pathogens can evolve rapidly under strong selective pressure to develop resistance to antimicrobial drugs and to avoid host immune system defences [10,11]. Accelerated rates of molecular substitution and neutral evolution are also observed, and are a result of several features of pathogen biology that enhance the emergence of new genotypes [12,13]. Many pathogens are characterized by generation times measurable in hours or days and by high rates of population growth and turnover. In addition, a rapid succession of population bottlenecks as infection spreads to new hosts and limited gene flow among demes can promote local genetic differentiation. Rates of molecular evolution are also accelerated, because of limited fidelity during genome replication. This situation is most prominent among single-stranded RNA viruses such as hepatitis C, measles and influenza. Such viruses lack a mechanism for correcting errors during replication, resulting in evolutionary rates ranging from 10^{−2} to 10^{−4} substitution per site per year, several orders of magnitude faster than observed rates in plants and animals [12,13].

Over the course of an epidemic, a rapid evolutionary rate will result in the emergence of multiple viral genotypes [14,15]. A phylogenetic tree inferred from these genotypes traces the pattern of ancestry among viral lineages and transmission between hosts. The structure of the phylogeny, and the timing and frequency of bifurcations are correlated with transmission events and the effective number of infections (*N*_{e}* τ*) [16–18]. Through the application of coalescent methods, such as skyline plots, the demographic history of a viral population can be reconstructed from the pathogen's phylogeny, linking genetic and ecological dynamics [18–20]. As an infection expands, the virus' genome is passed from infected to susceptible hosts, analogous to offspring receiving a genome from their parents. As a consequence, the skyline plot and the viral phylogeny are a reflection of the infection dynamics in the host population.

The linkage between ecological and evolutionary dynamics occurring on the same time-scale suggests the possibility that the evolutionary history of a virus can be used to distinguish among different ecological processes governing the spread of disease [17]. We suggest that different ecological processes will result in characteristic differences in evolutionary history making the former recoverable as distinct ‘signatures’ from virus sequence data. We also suggest these signatures may persist over time so that past ecological processes can be identified from viral sequence data collected years or decades after the fact.

One candidate process for establishing linkages between ecological and behavioural dynamics and the formation of lasting evolutionary genetic signatures is the dispersal of both host and pathogen. At present, we lack general approaches for relating spatial-genetic patterns seen in rapidly evolving pathogens back to underlying ecological and epidemiological processes [21]. Opportunities for spatial diffusion vary greatly between different types of virus, such that some are characterized by predominantly local contact (e.g. measles [22]), whereas long-distance translocation (LTD) has been highly influential in the spread of many other viral diseases of public health concern (e.g. monkeypox, severe acute respiratory syndrome, influenza, HIV and West Nile virus [23–25]). LDT facilitates the establishment of new centres of infection that are far from existing infected populations and can contribute to the rapid spread of disease. Individuals in many wildlife species are capable of traversing long distances, and in the process transporting their diseases [26,27]. LDT can also be the result of deliberate or unintentional human actions, as illustrated by the global dissemination of many human and animal diseases over the last few centuries [23,24,28].

We hypothesize that the presence of LDT during an epidemic, as a specific mechanism of movement that adds to the otherwise local processes of dispersal, will produce recognizable patterns in the viral phylogeny. Differences in phylogenetic branching patterns are expected as a result of the increased growth afforded by LDT. In the absence of LDT the advancing wave of infection is expected to travel at a constant velocity, *v* [29,30]. As a result we expect the area, and number of infected hosts, to grow in proportion to the square of time (*N*_{e}* τ* ∝

*(*

*π**vt*)

^{2}). On the other hand, we expect growth with LDT to be exponential. This follows from the exponential growth in the number of new foci of infection that emerge early in the epidemic. Considering a landscape composed of

*N*patches that are either infected or uninfected: at any time, the number of already infected hosts dispersing by LDT is proportional to the number of infected patches. LDT allows each disperser to potentially reach a large number of patches, and earlier in the epidemic most of these will still be uninfected. As a result, the rate of change in the number of infected patches is also proportional to the current number of infected patches. That is d

*N*/d

*t*=

*where*

*λ*N*is the growth rate in the number of infected patches. The net effect in this case is sustained exponential growth of the epidemic (*

*λ**N*

_{e}

*=*

*τ**Ce*). Are these two cases of epidemic expansion, with and without LDT, distinguishable from their resulting viral phylogenies?

^{λt}We also expect there to be detectable differences in the spatial arrangement of viral lineages associated with changes in the mode or rate of dispersal. One manifestation of these differences is a decline in isolation by distance (IBD). IBD is the amount of genetic variance between lineages that can be attributed to the geographical distance that separates those lineages [31]. Moreover, as the geographical distance between lineages increases, the genetic distance between lineages is expected to increase. Dispersal, on the other hand, is expected to decrease IBD because it tends to genetically homogenize populations and reduce the amount of between group variation. What remains unclear, however, is the specific form of the relationship between rates of movement and changes in IBD. We further expect the area occupied by a viral lineage, as well as its spatial overlap with other lineages, to increase with the rate of LDT.

Here, we explore the interaction between the molecular evolution of a virus and ecological processes that govern its spread, and how these interactions manifest themselves in the structure of the viral phylogeny. Our results are based on a spatially explicit SIR-type model [32] that includes the molecular evolution of the virus that occurs concurrently with the spread of infection. For illustrative purposes, we focus on the spatial spread of raccoon rabies and the concurrent molecular evolution of the rabies virus. Raccoon rabies is an excellent case study for exploring our hypothesis. The rabies virus genome consists of a single strand of RNA, eliminating the complicating effects of reassortment [33], shows little or no recombination, and has a rapid evolutionary rate (approx. 3 × 10^{−4} substitution per site per year) [34]. In addition, the rabies strain associated with raccoons is thought to be exclusively maintained and dispersed by this host species [33]. We compare the phylogenetic structure of the rabies virus that emerges under different estimated rates of LDT [35] with the following research aims: first, we use simulated data based on reasonable parameter values to determine whether the presence of LDT during an epidemic can be reliably detected from genetic data collected retrospectively. Second, we test whether evidence for LTD can be detected in empirical genetic data from a large raccoon rabies epidemic in the USA. Finally, we assess the spatial-genetic patterns predicted to arise from different levels of LTD. We discuss our findings in the context of a rising need for inference methods producing epidemiological insights from pathogen sequence data.

## 2. Material and methods

Our model is composed of three integrated components. The first is a discrete stochastic model of the spatial dynamics of the host population that emerge under the spread and persistence of an infectious pathogen. The second part of our model tracks the bifurcating tree of infections created as infection is spread from one host to another. Finally, our model simulates the molecular evolution of the virus along the branches and nodes of the infection tree. We use our model to simulate viral phylogeny that emerges under six different rates of LDT (*ϕ*_{LDT} = 0, 3.75 × 10^{−8}, 7.5 × 10^{−8}, 1.5 × 10^{−7}, 3 × 10^{−7}, 6 × 10^{−7}) that range from no LDT to the fastest rate based on the relationship between local and LDT.

For each rate of LDT, we simulated 40 replicate viral phylogenies shaped by 20 years of spatial disease spread. Each phylogeny is based on a random sample of 100 infected hosts sampled at the end of the 20-year period. All other model parameters are held constant among runs and differences that emerge in the viral phylogeny can be attributed to changes in the rate of LDT.

### (a) Model framework

Epidemic dynamics are described by a SIR-type model based on a system of ordinary differential equations (ODEs) [32]. The continuous ODE model is converted into a discrete stochastic model using Gillespie's method [36]. Our discrete stochastic model describes the local population dynamics within each of the patches that make up the larger landscape.

The equations for our ODE model are
2.1
2.2
2.3
2.4
2.5where *i* and *j* are a spatial indices, and *S _{i}*,

*E*and

_{i}*I*are the number of susceptible, exposed and infected individuals at location

_{i}*i*, respectively. The total population size in location

*i*is

*N*.

_{i}In the absence of infection, population growth is logistic with an intrinsic rate of increase given by *a,* and a density-dependent mortality rate, *b*. This results in a stable disease-free equilibrium at () *=* (*a/b*, 0, 0) in all locations *i*. Our estimate for the intrinsic rate of increase, *a*, is 3.67 × 10^{−3} day^{−1}, and is based on a mean birth rate of 1.33 kits per year [37–39]. Our estimate for the density-dependent death rate (*b* = 2.29375 × 10^{−7} day^{−1}) yields 16 000 raccoons per county, equivalent to a density of 10 raccoons km^{−2} [40,41]. Our estimated values for these and all other model parameters are summarized in table 1.

The infection of susceptible individuals occurs at a per capita rate of * βI_{i}* where

*is the transmission coefficient. Values for*

*β**in epidemiological models are often difficult to estimate, because infectious contacts are rarely observed [48]. Given the other model parameters, our estimate for*

*β**reproduces the observed 48-month cycle in infectious raccoons [42], and observed net reproductive rate,*

*β**R*

_{0}, for raccoon rabies of approximately 1.2 [34]. The transmission coefficient

*must also be scaled to match the spatial resolution of the model since the spread of infection between individuals in our model is a density-dependent process. In estimating values for the transmission rate and emigration rates, we applied a heuristic algorithm that evaluated*

*β**in increments of 10*

*β*^{−4}, while holding other model parameters fixed to find a value that minimized the absolute difference between the value of a statistic for the model and observed data.

Following infection, there is a latency period during which the virus spreads within a host from the point of infection, through peripheral nerve tissue and into the central nervous system [33]. This is included in the model by the term * σE_{i}* in equations (2.3) and (2.4), where 1/

*is the expected length of the latency period. The length of the latency period in raccoons is highly variable and ranges from 18 to 107 days with a mean of approximately 1/*

*σ**= 50 days [43–45]. The invasion of the central nervous system is rapidly followed by the migration of virus into the salivary glands and the shedding of virus in the saliva. At this point hosts are infectious, and this phase corresponds with the onset of the end-stage symptoms typical of rabies (e.g. hyper-aggression, hydrophobia, foaming saliva) [33]. Death follows quickly after the onset of symptoms and is included in the model through the term*

*σ**, where 1/*

*α*I_{i}*is the life expectancy of an infectious raccoon. The length of this period is extremely short, on the order of 14 days [33].*

*α*Our model includes both local dispersal and LTD of raccoons. Local dispersal is the movement of raccoons over the landscape in search of food, mates and shelter. Local dispersal of hosts is included in the model equations, where * ϕ* is the per capita dispersal rate. The rate of arrival from surrounding locations,

*j*, into location

*i*is governed by

*, where*

*ϕ*k_{ij}*k*is the fraction of dispersers from

_{ij}*j*that arrive in location

*i*. The

*k*s satisfy the condition that all emigrants from location

_{ij}*j*arrive in one of the surrounding areas (i.e.

*Σ*

*k*= 1).

_{ij}We use coefficients, *k _{ij}*, that restrict raccoon dispersal to the eight adjoining locations (a Moore neighbourhood) based on the spatial scale of the model (40 × 40 km

^{2}) and the relatively short dispersal distance reported for raccoons (0.5 to 10 km year

^{−1}) [40,49]. Our earlier analysis of raccoon rabies data estimated the speed of rabies spread to be between 34.6 and 42.2 km year

^{−1}[34,47]. The emigration rate,

*= 6 × 10*

*ϕ*^{−6}day

^{−1}, results in rabies spread of 38.4 km year

^{−1}. Detailed analysis of the raccoon rabies data for Connecticut indicates the rate of LTD is an order of magnitude smaller than the rate of local dispersal [35]. On this basis we have set the maximum rate of LTD at

*ϕ*_{LDT}= 6 × 10

^{−7}day

^{−1}.

Our model landscape is 600 × 1400 km divided into 525 (15 × 35) and 40 × 40 km patches. The total model area (840 000 km^{2}) and the patch size (1600 km^{2}) approximate the area affected by the mid-Atlantic raccoon rabies epidemic [50] and the average size of counties in the eastern USA (see the electronic supplementary material, figure S1). Patches within our landscape are connected by the local movement of raccoons between adjoining locales, and by LDT that allows movement across the entire landscape.

### (b) Stochastic model

Equations (2.1)–(2.5) are used to form a stochastic model following the method described by Gillespie [36]. Gillespie's method involves reconstituting the continuous dynamical model into a sequence of discrete events that represent the individual processes embodied within a system of differential equations. Gillespie's method provides a formal framework for determining the mean time-step between events, *r*, as well as the probability distribution of events that may occur at each step. Our equations include the effects of 13 distinct types of events: births (*aN _{i}*), density-dependent deaths arising from non-disease related processes (

*bN*,

_{i}S_{i}*bN*), infection of susceptible individuals (

_{i}E_{i}, bN_{i}I_{i}*), the progression of the host from latent incubation of the virus to infectiousness (*

*β*S_{i}I_{i}*), the death of infectious individuals of rabies (*

*σ*E_{i}*) and the local and LDT of individuals in each disease class (*

*α*I_{i}*,*

*ϕ*S_{i}*,*

*ϕ*E_{i,}*ϕ*I_{i}

*ϕ*_{LDT}

*S*,

_{i}

*ϕ*_{LDT}

*E*

_{i,}*ϕ*_{LDT}

*I*).

_{i}The sum of all the terms on the right-hand side of equations (2.1)–(2.3) is the total rate (*r*) at which events of all types occur within the population. That is, the total rate at which events occur is
2.5The inverse, 1/*r*, is the expected time between events, and is used to determine how much time elapses between events. The probability that a particular event occurs is given by the current rate for that event normalized by the total rate. For example, the probability that an infection will occur at the next event time is * βS_{i}I_{i}*/

*r*. The individual probabilities are combined to create a probability mass function of events (summarized in the electronic supplementary material, table S1) that determines the probability distribution of events.

The model dynamics are created by repeatedly computing the time to the next event, updating the model clock to the new time and choosing the type of the next event based on the probability mass function. At each step in the model, the expected time, 1/*r*, and the new probability mass function are recomputed from current values for the rates of each event type. The time to the next event is drawn at random from an exponential distribution with mean 1/*r*. Time is updated to the next event time and the type of event that occurs is determined from the probability mass function. The process repeats until the desired simulation length is reached.

### (c) Infection tree and molecular evolution

The link between the dynamics of infection spread and the viral phylogeny is the bifurcating tree of infection events that emerges as the pathogen spreads from host to host (see the electronic supplementary material, figure S2*a*). The second component of our model tracks the entire bifurcating tree of infections created as rabies infection is spread from one host to another. The infection of one host by another is represented by a fork, or node, in the tree. The two branches that extend from the node represent the source of infection and the newly infected individual. The branch that enters the node represents the originally infected individual. The length of each branch is measured in units of time and marks the interval between infection events. The root of the tree is the index case for the epidemic and the branch tips represent living hosts infected with rabies. The entire tree is a record of the spread of the virus between individual members of the population. It can be used to determine the chain of infectious antecedents that connect a particular individual to the index case or the infectious descendents that follow from a single infected individual.

As the disease spreads, each fork in the infection tree is a potential point of molecular diversification for the virus. The third component of our model simulates the molecular evolution of the virus along the branches and nodes of the infection tree. Once a new host has been infected, the processes of mutation and fixation occur independently in the two viral populations, and the substitutions that characterize each viral population accumulate independently (governed by a common evolutionary model and rate). The accumulated differences in the sequence are the basis for the molecular diversification of the viral phylogeny. Further, because of the rapid evolutionary rate of RNA viruses, the structure of viral phylogeny closely follows the structure of the infection tree (see the electronic supplementary material, figure S2).

### (d) Skyline plot analysis

We measure the overall growth of an epidemic based on the rate of increase in the effective number of infections (*N*_{e}* τ*) revealed through a skyline plot analysis [20]. For each replicate run, we extracted the infection tree that connected 100 infected raccoons selected at random, and simulated the molecular evolution of the virus along the nodes and branches of the infection tree to produce a viral phylogeny. We then computed a generalized skyline plot for each viral phylogeny using the R package ape [51] to estimate the change over time in the effective number of infections (

*N*

_{e}

*). Next, we used maximum-likelihood to fit both a time-squared (*

*τ**N*

_{e}

*(*

*τ*=*π**vt + b*)

^{2}) and exponential (

*N*

_{e}

*) model of growth to the log-transformed time-series of*

*τ*= Be^{λt}*N*

_{e}

*values produced by the skyline analysis. We used*

*τ**Δ*AIC to determine which of the two growth models was a better fit.

### (e) Spatial analysis

For each rate of LDT, we correlated the pairwise geographical and genetic distances to assess the magnitude of IBD. For each correlation we use the slope (*m*_{IBD}) and the correlation coefficient () of IBD to measure the strength of the relationship between geographical and genetic distance. The slope (*m*_{IBD}) of the correlation indicates the rate at which the genetic distance between samples increases with increasing geographical distance. The IBD correlation coefficient () is a measure of the amount of variation accounted for by a linear relationship. The slope and correlation coefficient together reflect the strength of the correlation and the degree of IBD.

The spatial spread of infection causes viral genotypes to occur in clusters that partition the landscape into patches (see the electronic supplementary material, figure S3) [34,52]. We computed the total area and the area of overlap of the geographical ranges encompassed by clusters of related viral genotypes. The area and overlap of genetic clusters explicitly includes the hierarchical details of the viral phylogeny. A single convex polygon (i.e. one with no concave features) that just encompasses a set of related viral genotypes can be used to define the area occupied by a viral lineage. The area of such a polygon is expected to increase as the rate of LDT is increased because related viral genotypes are more likely to be separated at higher rates of dispersal. The aggregation of genetic lineages into groups contains an element of arbitrariness. We address this issue by defining the endemic phase groups in terms of transition time *t _{T}*, at which the epidemiological dynamics shift from epidemic to endemic. Specifically, we are interested in those groups of viruses where all isolates share a common ancestor ‘a’ that is more recent than

*t*and that maximize the number of isolates within each group.

_{T}### (f) Sensitivity analysis

We performed a sensitivity analysis of our model to see if our results were robust to changes in the model parameters. We performed an additional 12 runs in which we either increased or decreased by 10 per cent the value of each of the six model parameters (*a*, *b*, * β*,

*,*

*σ**,*

*α**). We also considered an increased spatial domain for our model (50 × 30 cells) to explore the possibility that our results were scale-dependent or influenced by close proximity of the landscape boundaries along the long axis of the model domain.*

*ϕ*In contrast to our model, real-world landscapes are spatially heterogeneous which can introduce noise to biological processes that makes detecting patterns difficult. For example, random variations in habitat quality may result in spatial variation in the demography of the host population and may alter the pattern and tempo of an epidemic. We addressed the question whether such spatial heterogeneity could obscure the phylogenetic signatures associated with different rates of LDT by making the local carrying capacity of each cell and the rate of movement between cells spatially random variables.

To vary carrying capacity, we assigned a random value to the per capita birth rate, *a*, to each location in our simulated landscape. Random values were drawn from a lognormal distribution with a mean equal to the baseline birth rate (table 1). The variance for the distribution was chosen to place half of distribution area between ±10% of the mean value and half outside these limits (see the electronic supplementary material, figure S4). This resulted in the carrying capacity ranging between 7.5 and 13.3 raccoons km^{−2} at 95 per cent of locations. Spatial variation in the dispersal rates was created using the same method of assigning a random deviate to each location from a lognormal distribution centred at the mean local dispersal rate (* ϕ* = 6 × 10

^{−1}day

^{−1}).

For each replicate run of the model we created a new random landscape that was held constant over the course of each replicate. We considered scenarios that included stochastic variation in the carrying capacity only, variation in the dispersal rate only and a scenario in which spatial variation in both demographic parameters was combined.

## 3. Results

We find that there are unique signatures embedded within the structure of the viral phylogeny produced by different rates of LDT. Skyline plots reveal the effects of LDT as changes in the growth rate of the effective number of infections (*N*_{e}* τ*; figure 1

*a*). As we hypothesized, there are discernible and diagnostic differences in the pattern of growth characteristics of the epidemic with and without LDT. Our results, based on the sign of

*Δ*AIC and summarized in table 2, support our hypothesis that epidemics characterized by local dispersal (

*ϕ*_{LDT}

*=*0) should follow a time-squared pattern of growth while exponential growth will arise when spatial spread includes LDT. In 72.5 per cent of epidemics without LDT the time-squared model is a better fit to the skyline data (

*N*

_{e}

*). On the other hand, exponential growth is a better fit in 80–94.7% of epidemics when LDT is included as a mode of host dispersal (*

*τ*

*ϕ*_{LDT}≠

*0*). Increasing rates of LDT also produce differences in the phylogenetic structures that can be revealed as increasing rates of growth in the skyline analysis. Higher rates of LDT result in faster epidemic growth rates with a linear relationship between the rate of exponential growth and the non-zero rate of LDT (

*= 4.02 × 10*

*λ*^{3}

*ϕ*_{LDT}+ 2.1 × 10

^{−4},

*R*= 0.68,

^{2}*p*< 0.001) (figure 1

*b*).

Variation in the value of the model parameters produced results that were similar to those observed using the baseline parameter values. The relationship between the non-zero rates of LDT and the exponential growth rate of the epidemic (* λ*) remained positive and linear, with the slope of the relationship ranging between 1.9 × 10

^{3}and 7.5 × 10

^{3}(see the electronic supplementary material, figure S5). The time-squared function continued to be the best fit to the growth of

*N*

_{e}

*when*

*τ*

*ϕ*_{LDT}

*=*0, and the exponential function continued to be the best fit to growth when

*ϕ*_{LDT}≠ 0 (see the electronic supplementary material, table S2). Our results on a larger landscape were qualitatively similar to those produced on the original landscape (see the electronic supplementary material, table S2). The results of our sensitivity analysis are given in the electronic supplement.

Our results were also robust to landscape heterogeneity in the host carrying capacity and local dispersal rates. The time-squared model of growth was still a better fit when there was only local dispersal (*ϕ*_{LDT} = 0), and the exponential growth model was the better fit in the presence of LDT (see the electronic supplementary material, table S2). In addition, the positive relationship between the rate of LDT and the rate of exponential growth was nearly identical to the relationship obtained from the model without spatial heterogeneity.

We used our model to directly compare the true demographic process of disease spread to the estimate derived from the skyline plots. Such a comparison provides insight into the degree of correlation between skyline reconstruction and the population dynamics when there are multiple complicating factors such as spatial structure and demographic stochasticity. We found the effective number of infections (*N*_{e}* τ*) did not reflect the total number of infected hosts (

*I*

_{total}) when the two datasets were compared directly. Linear regressions of number of infected hosts on

*N*

_{e}

*had low coefficients of determination (*

*τ**R*

^{2}) and slopes that were not significantly different from zero. However, the lack of correspondence was in large part due to demographic stochasticity to which the skyline plots are not sensitive. On the other hand, we found a strong correspondence when we used the rate of exponential growth to compare the demographic data to the skyline reconstructions. A linear regression of the rates of exponential growth from both sources indicates a strong linear relationship between the two (figure 2).

We also found that the demographic data was not as reliable a source as the skyline plots for detecting the differences between time-squared and exponential growth. For each replicate run, we computed the daily total number of infected hosts, *I*_{total}, from the spatial dynamics. We then fitted models that assumed either time-squared or exponential growth during the period of spatial spread to the time-series of *I*_{total}. Maximum likelihood was used to estimate values for the growth model parameters and AIC was used to compare the relative fit of each model with the demographic data. Unlike our results for the skyline plot (table 2), the time-squared growth model was not the better fit to the demographic data when *ϕ*_{LDT} = 0 and instead the exponential model was a better fit to *I*_{total} both with and without LDT (see the electronic supplementary material, table S4). The difficulty in detecting time-squared growth arises because the demographic data is noisy as a result of demographic stochasticity. This tends to mask differences between time-squared and exponential growth.

We apply our model to explore the mode of epidemic growth exhibited by the mid-Atlantic raccoon rabies epidemic reported by Biek *et al*. [34]. Based on their analysis, Biek *et al*. [34] divided the raccoon rabies epidemic into three phases of epidemic expansion corresponding to different physical features of the landscape over which the epidemic spread. We have re-examined these three phases and performed a maximum-likelihood fit of a time-squared versus exponential model to the growth in each phase (phase 1: 1975 through 1982, figure 3*a*; phase 2: 1986 through 1989, figure 3*b*; phase 3: 1989 through 1992, figure 3*c*). As in the analysis of our model results, we use *Δ*AIC to summarize the relative fit of the two modes of epidemic growth. The procedure we followed to perform the maximum-likelihood fit are described in the electronic supplementary material.

We find that in each phase, the exponential model was a better fit to the Bayesian skyline data than the time-squared model (phase 1: *Δ*AIC = −65.2; phase 2: *Δ*AIC = −6.6; phase 3: *Δ*AIC = −9.6). Based on the classification derived from our model, these results suggest that LDT of infected hosts was not only occurring, but also had a pronounced effect during the mid-Atlantic spread of raccoon rabies. While we are not claiming this as definitive proof of a role of LDT, the exponential growth observed in the phylogenetic data is consistent with an analysis of the demographic data which found that LDT occurred frequently during the spread of raccoon rabies in Connecticut [35].

We are also able to use our model results to quantify the relationship between the spatial distribution of viral lineage and rate of LDT. We find that the degree of IBD and the rate of LDT follow a power law ( () (figure 4*a*). These results indicate that the strength of IBD will vanish as the rate of LDT increases. This is to be expected since faster rates of LDT will tend to homogenize the landscape, causing all viral genotypes to become evenly distributed in space. More specifically, the small value for the scaling factors linking *ϕ*_{LDT} and *m*_{IBD} suggest a persistent signature of IBD over a wide range of LDT rates. We find that the area encompassed by genetic groups and the area of overlap between groups increase as the rate of LDT increases (figure 4*b*). These changes reflect changes in the local spatial structure of the landscape under different rates of LDT. However, as the rate of LDT increases beyond a threshold, the area of groups and overlap reach an upper limit. The upper limit is a consequence of the finite landscape over which an infectious disease could spread.

## 4. Discussion

Our results suggest that differences in fundamental ecological processes, represented in our analysis by differences in the magnitude of LTD, may result in detectable differences in the topology and geographical pattern of pathogen phylogenies. These differences in genetic effects leave a persisting ‘signature’ indicative of specific ecological processes. Even when samples collected from an epidemic reflect many local processes interacting over space, different rates of LDT leave a detectable signature within a viral phylogeny.

Our model includes the demographic stochasticity of the host's birth/death process as well as the disease cycle of rabies from transmission, to incubation, infectiousness and finally death. We have also considered the effects of spatial heterogeneity in the demography of the host species. Despite the considerable noise these processes are expected to add, we are able to reveal several differences in the phylogeny of the virus produced by LDT. Using skyline plots to examine the sequence of coalescent events in the viral phylogeny, we can compare the rate of growth of the epidemic for different movement scenarios. In accordance with our expectations, we find that epidemic growth without LDT follows a time-squared function, whereas growth with LDT is exponential. Further, we find an increasing linear relationship between the rate of exponential growth and the log-transformed rate of LDT.

Our comparison of the skyline plots to the epidemic dynamics indicates that the skyline analysis can provide a reliable estimate of the true demographic changes in the host population. We found that the rate of exponential growth estimate from *N _{e}τ* was in close agreement with the rate estimate taken directly from the demographic data (figure 2).

Our results also point to the value of combining phylogenetic analysis with analysis of demographic data collected from an epidemic. We found that the mode of epidemic growth may be easier to detect from the signatures embedded in the viral phylogeny than from the analysis of the demographic data. Time-squared and exponential growth were distinguishable from the pattern of the viral phylogeny (table 1), but this was not the case using only demographic data (see the electronic supplementary material, table S4). Demographic stochasticity in the dynamics of an epidemic introduce noise into the demographic data which may tend to obscure differences between ecological processes such as the model of dispersal. On the other hand, while the rate of rabies virus evolution is fast, it is still not capable of tracking day-to-day changes in the size of an epidemic, and so may be providing a smoother picture of epidemic growth.

In an earlier analysis, solely based on ecological data, we established the importance of LDT during the spread of raccoon rabies in Connecticut over a five-year period of expansion [35]. Mathematical models simulating the dynamics of spread that included LDT as a distinct process result in a better agreement with the observed pattern of spread relative to models that did not include LDT. This earlier analysis used models and data calibrated at the spatial scale of townships, with a mean area of 85 km^{2}. Our current analysis, however, suggests that the effects of LDT are preserved across spatial scales. The spatial resolution of our analysis of the entire eastern seaboard epidemic is at the level of counties, with a mean area of 1600 km^{2}. The combined analysis reveals a process of expansion characterized by foci of secondary infection acting in advance of the encroaching epidemic wave at both local and regional spatial scales.

Our model results are also consistent with results suggesting that there may have been considerable heterogeneity in the rate of rabies spread in the eastern USA. Using a model that allowed for overdispersed variation in the rate of spread along different branches in the phylogeny (a relaxed random walk), Lemey *et al*. [54] document a better fit to the genetic data than a time-homogeneous Brownian diffusion model. As an important distinction, the approach by Lemey *et al*. is focused on revealing and quantifying heterogeneity in the spatial diffusion process from geo-referenced samples without being tied to any particular ecological mechanism. Variation in branch-specific diffusion rates in their case may thus occur for a variety of reasons, including variation in the rate of local spread, the existence of dispersal barriers as well as some contribution of long-distance dispersal. Here, we suggest a complementary approach in which genetic data is analysed with a model that makes explicit the contribution of different modes of dispersal and can be used to make predictions about spatial patterns of growth.

Our analysis based on skyline plots does not include spatial location information from our spatially explicit model. However, our results suggest that phylogenetic patterns associated with spatial processes can be detected even when reliable spatial data are not available. Collection of spatial data that span the period of an epidemic has been an important asset in reconstructing the geographical spread of several infectious diseases including foot and mouth disease [55], and rabies [34,50,54]. However, the collection of accurate and complete public health data is a significant challenge in many parts of the world [56–59]. Our results suggest that analysis of viral sequence data may be useful in overcoming some of these data deficiencies.

Our results are based on viral sequences collected 20 years after the start of the epidemic, when the disease is well into its endemic phase. We chose this retrospective sampling regime because long delays between when an epidemic starts and when it is first sampled are common in infectious disease studies. Wildlife diseases are notoriously difficult to detect because of insufficient surveillance [60]. Even in human populations, epidemics can go unnoticed for prolonged periods of time as highlighted by the discovery of HIV in the early 1980s nearly 30 years after the virus’ suspected emergence in Africa [61,62]. The late sampling date allows for prolonged action of stochastic processes, such as genetic drift and small numbers of infected hosts, which tend to obscure the effect ecological processes have on the viral phylogeny. Despite this, we find that differences in the mode and tempo of dispersal can be recovered from the viral phylogeny (figures 1,3 and 4) decades later. We expect alternative sampling procedures that include collection of some or all samples at times closer to the start of an epidemic to provide better estimates of past ecological processes.

Our results also indicate that at the spatial scale of US counties used in our model, the ability to detect the effects of LDT are limited if sampling were to occur immediately after the start of an epidemic. The skyline plots for different rates of LDT in figure 1*a* overlap and are essentially indistinguishable in the period immediately following the epidemic. As a result, it would be impossible to detect differences associated with different rates of LDT. This outcome is expected because early in the epidemic there will have been few or no opportunities for LTD events larger than a single county to occur. In such fortunate circumstances where an epidemic is identified before it has a chance to spread, direct monitoring of the epidemic would be a more efficacious approach to studying the disease dynamics. In addition, highly accurate geo-referenced genotype data may allow an earlier assessment of the role of LDT than we observe in our model. However, the chance to observe an epidemic from its very beginning is likely to be the exception. For many epidemics, in both human and wildlife populations, techniques and theory can help reveal unobserved epidemic dynamics.

As long as mutational events occur on similar time-scales as the generation time and the time-scales of spatial movement of the pathogen, this should result in the emergence of phylogeographic structure that can yield insights into the ecological processes that shape the spread of disease, here specifically dispersal. Systems this may apply to are other viruses in terrestrial wildlife hosts, such as hanta- or arenaviruses (potentially long infectious periods offset by limited dispersal ability in the host) or canine distemper, as well as certain plant viruses.

## Acknowledgements

The authors thank Vijay Panjeti and Derek Johnson for their comments. This research was supported by NIH grant no. RO1 AI047498 to L.A.R., and the RAPIDD Programme of the Science and Technology Directorate, The Department of Homeland Security and The Fogarty International Center, National Institutes of Health and the Louisiana Board of Regents grant no. LEQSF(2011-14-RD-A-27) to S.M.D. Computer simulations were performed on the Emory High Performance Computer Cluster and the Louisiana Optical Network Initiative computers.

## Footnotes

One contribution of 18 to a Discussion Meeting Issue ‘Next-generation molecular and evolutionary epidemiology of infectious disease’.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.