## Abstract

Mammalian DNA is littered with the signatures of past retroviral infections. For example, at least 8% of the human genome can be attributed to endogenous retroviruses (ERVs). We take a single-locus approach to develop a simple susceptible–infected–recovered model to investigate the circumstances under which a disease-causing retrovirus can become incorporated into the host genome and spread through the host population if it were to confer an immunological advantage. In the absence of any fitness benefit provided by the long terminal repeat (LTR), we conclude that signatures of ERVs are likely to go to fixation within a population when the probability of evolving cellular/humoral immunity to a related exogenous version of the virus is extremely small. We extend this model to examine whether changing the speed of the host life history influences the likelihood that an exogenous retrovirus will incorporate and spread to fixation. Our results reveal the parameter space under which incorporation of exogenous retroviruses into a host genome may be beneficial to the host. In our final model, we find that the likelihood of an LTR reaching fixation in a host population is not strongly affected by host life history.

## 1. Introduction

Mammalian genomes are littered with the signatures of ancient retroviral infections. Endogenous retroviruses (ERVs) are formed by the successful incorporation of an exogenous retrovirus into the germline and its subsequent vertical transmission. ERVs have been described from a wide range of vertebrate genomes, including mammals, fish, birds, reptiles and amphibians [1]. They constitute a significant proportion of host genomes: in humans, ERVs represent 8% of the genome and over 10% of the genome in mice [2,3]. Typically, an ERV consists of three genes (*gag*, *pol* and *env*) and two flanking non-coding long terminal repeats (LTRs), which are identical at the time of integration. Over time, these insertions will have accumulated mutations and deletions at the same rate as the mutation rate of the host genome [4], rendering them non-functional. ERVs may also be inactivated by recombinational deletion between the two flanking LTRs, which removes the internal coding region, leaving a solo LTR. Solo LTRs are 10–100 times more numerous than their full-length counterparts [5], and many of these insertions are fixed in the host population. In various mammal species, there are a few examples of intact, evolutionarily young ERVs that are polymorphic (presence/absence) in their host population [6–11]; whether this is persisting polymorphism maintained through various evolutionary forces or actual active ERVs remains to be established. However, some ERVs do appear to be replication competent [10,12–16].

Although most insertions are inactive remnants subject to forces such as genetic drift, some have maintained open reading frames (ORFs), exhibit low *d*_{N}/*d*_{S} ratios (non-synonymous–synonymous mutations), or have fewer stop codons than expected under neutrality, indicating purifying selection [17,18] and therefore functional constraint. There are numerous examples, in a range of species, where a retroviral insertion has been selected for as it confers a fitness advantage on the host, perhaps through co-option of a retroviral gene or through gene regulation [19–22]. There are also instances involving the host using the promoters in the LTRs of retroviruses to regulate expression of a host gene close to the point of retroviral insertion [21,23–25]. However, in most locations, these insertions have no known function. The increasing availability of genomic data has provided unique opportunities to study retroviral insertions, and recent advances have revealed that a number of ERVs have been co-opted as cellular genes which act as part of the immune system of the host, inhibiting viral infection [26]. This is a remarkable immune strategy in the evolutionary arms race between viruses and their hosts: the integration of viruses for use against themselves.

These observations raise some important questions: under what circumstances do ERVs successfully spread through host populations? Conventionally, it is thought that once inserted into the genome the insertions we see have drifted to fixation, but are there other mechanisms via which an active ERV could provide a sufficient fitness advantage to spread through a population, before losing its apparent advantage when it reaches fixation, such that it can be excised from the human genome? One plausible suggestion is that for some ERVs, incorporation into the genome provides the host with some immunity from related exogenous retroviruses [27,28] (described as *e*ndogenous viral element *d*erived *i*mmunity, EDI [26]). There are a number of examples of ERVs conferring some immunity through a variety of mechanisms [9,29,30]. Modelling provides a route to exploring this possibility.

Compartmental models offer a useful tool to examine both the ecological and evolutionary dynamics of ERVs [31]. Such models are widely used to study the dynamics of a range of pathogens [32], but have not been previously used to study the consequences of ERV ‘epidemics’ on host dynamics. Compartmental models structure a population by the pathogen status of the host. The simplest models have two states: those susceptible to infection (*S*) and those infected (*I*) with the pathogen. More complicated models include a recovered class (*R*) [31]. We develop a range of models that can include multiple infected and recovered states, and use these models to determine the conditions under which incorporation of retroviruses into the host genome benefits both the pathogen and the host.

We also investigate the impact of host life history on the probability that an LTR reaches fixation at an arbitrary, unspecified location in the host genome. The immune systems of short- and long-lived species can differ in many important ways, some of which may influence the fitness of a strategy that involves incorporation of an exogenous retrovirus into the host genome. There is evidence to suggest that immune defences may vary with the life history of the host [33]—that different immune response strategies may be used to a greater or lesser extent than others, depending on certain life-history traits. Mouse genomes, in particular, have many active ERVs, which contrasts strikingly with the human genome, where all ERVs are nearly extinct, with the possible exception of HERV-K [6], potentially suggesting that there may be some aspects of the host, perhaps overall exposure to retroviruses, that influences whether ERVs remain active or become extinct. We extend the model to examine whether changing the speed of the host life history influences the likelihood that an exogenous retrovirus will incorporate and its inactive signature spread to fixation.

## 2. Modelling philosophy

Our aim is to describe the initial invasion of a genome by a retrovirus. We develop a discrete time ‘SIR’ model that we can use to examine the proportion of the population in each compartment after an infection has run its course. The models we construct consist of compartments, with parameters describing the rates of movement of individuals between classes (figure 1) [32]. We develop a range of models of increasing complexity, with the final model containing five compartments. This allowed us a detailed understanding of the predicted dynamics, even though we were unable to derive exact analytical solutions to the more complex models. Our use of compartmental models is novel as we are not interested *per se* in the dynamics of the infection; rather we are interested in the stable population structure after the epidemic has run its course, and particularly in the proportion of the population in the recovered state expressing a signature of the retroviral infection.

Before a population is exposed to an exogenous retrovirus, all individuals are in the susceptible compartment (*S*)—this means they are susceptible to becoming infected with the exogenous retrovirus. If they encounter an infected individual, then they have a probability of becoming infected. Once individuals are infected, then they can move into the infected with the exogenous virus compartment (*I*_{X}). Offspring born to individuals in this compartment are susceptible to infection, and they enter the susceptible compartment. Our first model is an SI model, consisting of just these two compartments (figure 1*a*). This model is useful because it is analytically tractable.

In our next model, we incorporate a second infected compartment for individuals that have incorporated the exogenous retrovirus into their germline (*I*_{E}). Individuals enter this class as offspring of parents that have incorporated the exogenous virus into the germline. These offspring have inherited the endogenous retroviral insertion, which is still producing infectious viral particles [34,35]. Offspring born of parents in this class will remain in the *I*_{E} compartment unless mutation deactivates the functional ERV, or recombinational deletion removes the ERV from their genome, in which case these individuals enter the susceptible compartment. For individuals in the *I*_{E} class, any such immunity provided by endogenization would be inherited. This model is an *SI*_{X}*I*_{E} model. We were unable to obtain any useful analytical solutions to this model, but its analysis is useful as it helps us identify the range of conditions under which a large proportion of a population ends up with incorporated retroviruses at a particular locus in their genome. Consequently, modelling the dynamics as whether incorporation occurs at a particular site in the genome allows us to understand the processes that can lead to an endogenous retroviral signature spreading through a population. We analyse this model graphically over the complete range of possible parameter values for a single, approximately human, life history.

There is another route to leaving the *I*_{E} and *I*_{X} compartments: through acquiring immunity to either the endogenous or exogenous retroviruses, and entering respectively the recovered with, and recovered without, an endogenous retroviral signature in their DNA (*R*_{LTR} and *R _{N}*; figure 1

*c*). Individuals in the

*R*class will have mounted a successful immune response to the exogenous retroviral infection, causing inactivation or clearance of the exogenous virus [36]. Consequently, their offspring will have no inherited immunity, and they return to the susceptible class. By the time individuals leave the

_{N}*I*

_{E}compartment, and the ERV insertion has undergone recombinational deletion to produce the solo LTR (i.e. individuals enter the

*R*

_{LTR}class), we assume the threat of the exogenous retrovirus has passed. Whatever immunity was offered by the ERV insertion is no longer required and thus the selection pressure on retaining the full-length ERV insertion is removed. Individuals cannot leave this recovered class. This class of model is an absorbing Markov chain. Because we were unable to obtain analytical solutions to the

*SI*

_{X}

*I*

_{E}model, we were unable to analytically analyse this model either. We consequently provide numerical results.

## 3. The models

### (a) A susceptible–infected model

We start with the simple *SI*_{X} model (figure 1*a*) of the dynamics of an exogenous retrovirus to which immunity cannot occur [37]. Susceptible individuals move into the infected state as a function of the proportion of the population in the susceptible and infected compartment and an infection rate, * β*. This simple model can be written
3.1where

*t*represents time,

*is the birth rate which does not differ between compartments,*

*τ*

*ϕ*_{S}is the survival rate of susceptible individuals and

*ϕ*_{S}

*ϕ*_{X}is the survival rate of individuals infected with the exogenous retrovirus. Consequently, 1−

*ϕ*_{X}is the mortality rate imposed by the exogenous virus. Individuals born to infected individuals enter the susceptible compartment:

*N*(

*t*) =

*S*(

*t*) +

*I*

_{X}(

*t*).

We are interested in the proportion of the population in each of the two compartments. It is straightforward to solve this equation analytically, which reveals that the ratio of susceptible to infected individuals is 3.2And the proportion of the population in the infected compartment is 3.3

### (b) Adding a second infected compartment

From this starting point, we can introduce a little more complexity into the model by introducing a second infected compartment (*I*_{E}) for individuals infected that have incorporated the virus into their genomes. For individuals to enter the *I*_{E} compartment, they must pass through the *I*_{X} compartment. We distinguish between the two infected states, and with parameters associated with each compartments, using the subscripts X (for exogenous) and E (for endogenous; figure 1*b*).

The *SI*_{X}*I*_{E} model is of the form
3.4Here, the new parameters are * γ*, the rate at which exogenous retroviruses incorporate into the host genome and

*, the rate at which functioning ERVs are lost from the genome through mutation or recombinational deletion. We assume that individuals in the*

*α**I*

_{E}compartment are less infectious than those in the

*I*

_{X}compartment (

*β*_{E}<

*β*_{X}) as exogenous retroviruses are typically more virulent than their endogenous counterparts. 1 −

*ϕ*_{X}is the mortality rate of individuals in compartment

*I*

_{X}.

*N*(

*t*) =

*S*(

*t*) +

*I*

_{X}(

*t*) +

*I*

_{E}(

*t*).

### (c) Incorporating recovered compartments

We now expand the *SI*_{X}*I*_{E} model to incorporate two recovered classes to derive an *SIR* model (specifically an *SI*_{X}*I*_{E}*R*_{N}*R*_{LTR} model; figure 1*c*). Movement between the five compartments can mathematically be described with the following five equations:
3.5where * θ* is the rate at which immunity occurs in the exogenous compartment,

*is the rate at which immunity occurs in the endogenous compartment, and*

*θ*′*N*(

*t*) =

*S*(

*t*) +

*I*

_{X}(

*t*) +

*I*

_{E}(

*t*) +

*R*(

_{N}*t*) +

*R*

_{LTR}(

*t*).

Our initial simulations indicated that the model structure described above was unrealistic for faster life histories in that LTRs did not reach high frequencies within the population—there are many examples of species with faster life histories which display signatures of retroviral infection [3]. We therefore modified the model to make it more biologically realistic, and by relaxing some of the assumptions of the original *SI*_{X}*I*_{E}*R _{N}R*

_{LTR}model, reduced the flow of individuals from right to left (figure 1

*d*). Vertical transmission of retroviruses from mother to offspring is quite likely [36,38], and we altered the model to reflect this. Offspring born to infected mothers (

*I*

_{X}) now enter the

*I*

_{X}compartment, and offspring born to mothers with a functioning version of the endogenous retrovirus (

*I*

_{E}) that have experienced mutation or recombinational deletion, before immunity (EDI) has arisen, enter the

*I*

_{X}compartment. The modified equations are 3.6

### (d) Parameter values for the model

The survival and fertility rate parameters indirectly influence the rate of movement between all compartments—these values also determine the life history of the species [39]. Large survival rate parameters and small fertility rate parameters correspond to a species with a slow life history, whereas large fertility rate parameters and small survival rate parameters correspond to a species with a faster life history. The model can consequently be used to examine how life history influences the probability of incorporation of an exogenous retrovirus into the host genome.

The maximum population growth rate the population can achieve occurs when all individuals are in the *S* and *R*_{LTR} classes. When this occurs, the population growth rate is *ϕ*_{S} + * τ*. We set this to be greater than unity to prevent extinction. For all models and simulations we report, we constrain

*ϕ*_{S}+

*= 1.016. We chose this constraint as extinction did not occur in our simulations for this value, but could occur if the value was smaller.*

*τ*We alter the life history of the species by changing the values of *ϕ*_{S} and * τ* such that their sum always equals 1.016. As

*gets larger, the life history speeds up, and as*

*τ*

*ϕ*_{S}increases the life history slows down. We vary

*ϕ*_{S}and

*in increments of 0.01. For each life history, we then independently vary values of all other parameters between zero and unity (they are all rates). Simulations are run for 5000 time steps before we record the proportion of the population in the*

*τ**R*

_{LTR}compartment.

In our initial model, and in the absence of any published information on mortality rates attributable to ERVs, we conservatively impose a mortality rate of 0.03 attributable to the exogenous virus (1−*ϕ*_{X}), and a rate of 0.015 attributable to the endogenous virus (1−*ϕ*_{E}). We make the assumption that the mortality rate of the *I*_{E} group is less than that of the *I*_{X} compartment, because exogenous viruses typically kill host cells they infect, whereas ERVs do not. We impose a survival cost of incorporation of the virus into host DNA, because the host continues to produce virions. The *I*_{X} compartment is likely to produce more virions than the *I*_{E} group for the same reason that they have different mortality rates. For this reason, we also assume *I*_{X} individuals are more infectious; we set *β*_{E} = 0.4 and *β*_{X} = 0.5 in our initial model.

There is little available information about values for the other parameters in the literature. From one of the most studied genomes, the human genome, the rate of loss of an active ERV at a point in the genome via mutation or recombinational deletion is very low (in humans, the latest estimates of mutation rates are around 1.2 × 10^{−8} per site [40,41], and rates of recombinational deletion for HERVs are estimated at 4.3 × 10^{−7} to 8 × 10^{−5} [42]). Little is known about the rates of integration. For one of the youngest HERV families, HERV-K, many insertions are shared with chimpanzees. However, a number of HERV-K insertions are human specific (approx. 130; calculated from the reference sequence and available literature), having integrated after the divergence of humans and chimpanzee (approx. 7 Ma) [43]. These are clearly those that are assumed to have drifted to fixation—many others insertions would have been lost (due to drift). Assuming a generation time of 20 years and effective population size of 10 000, we can calculate that the probability of a new retroviral insert per genome/per generation is 3.7 × 10^{−4}—again quite low. For our model, the key issue is that the rate of insertion per genome is greater than the rate of loss. We consequently assume * α* <

*. The specific values of alpha and gamma determine the rates of spread of the epidemic through the population, but not the equilibrium outcome.*

*γ*Little is known about the rates at which immunity would arise and so we arbitrarily set * θ* = 0.05 and

*′ = 0.01. Some of these values may seem rather high. However, analysis of the model reveals that the dynamics it predicts are relatively constant across a wide range of parameter values (see §4) and that one of our key results depends upon the ratio of*

*θ**to*

*θ**(§4). In other words, the absolute parameter values of*

*θ*′*α**,*

*θ**and*

*θ*′*are relatively unimportant; it is their relative values that determine the dynamics. As the values of*

*α**,*

*θ**and*

*θ*′*get closer to zero, the longer the epidemic lasts and the longer simulations need to be before the asymptotic equilibrium is achieved.*

*α*All simulations were conducted in R v. 2.15.1 [44].

## 4. Results

We start by focusing on a slow, approximately human life history, simply because more is known about human genomes than for any other species. We set *ϕ*_{S} = 0.95 and * τ* = 1.016–0.95 = 0.066. After considering a slow life history, we explore the dynamics for life histories of other speeds.

### (a) *SI*_{X} model for a human life history

The dynamics of the *SI*_{X} model are straightforward to understand, depending on only four parameters (equation (3.1)). The proportion of the population in the infected class decreases as the fertility rate, * τ*, increases. This is because the second term on the right-hand side of (3.1) increases in size as all newborns enter the susceptible class. As the infection rate,

*, increases in relation to*

*β*

*ϕ*_{S}(

*ϕ*_{X}− 1), then the proportion of infected individuals increases as the second term on the right-hand side of (3.1) becomes smaller. When the virus imposes no mortality cost, then the proportion of infected simplifies to 1 −

*for all values of*

*τ*/*β**<*

*τ**.*

*β*### (b) *SI*_{X}*I*_{E} model for a human life history

Figure 2 provides complete results for the *SI*_{X}*I*_{E} model across the full range of parameter values for * α*,

*,*

*γ*

*β*_{E}and

*β*_{X}. We do not perturb

*ϕ*_{S},

*ϕ*_{X},

*ϕ*_{E}or

*, as at this stage we focus on a species with a slow life history and we assume both the endogenous and exogenous retroviruses impose a mortality cost.*

*τ*The frequency of the population in the *I*_{E} compartment increases as the rate of incorporation (* γ*) increases relative to the rate of loss (

*), with the highest frequencies achieved when*

*α**is high and*

*γ**is low, a pattern observed in data (please refer to our reasoning for values of*

*α**α*and

*γ*in §3

*d*). In addition to the effects of

*and*

*α**, the rates of infection—*

*γ*

*β*_{E}and

*β*_{X}—also determine the frequency of the population in the

*I*

_{E}compartment. When

*β*_{X}is small, few individuals enter the

*I*

_{X}compartment, with even fewer making it through to the

*I*

_{E}compartment.

The most likely rates of the parameter space are highlighted in the bottom left-hand corner of figure 2, where rates of loss owing to mutation or recombinational deletion are close to zero and rates of incorporation are also small. This suggests that a large proportion of the population ends up in the *I*_{E} compartment under likely values of * α* and

*.*

*γ*### (c) *SI*_{X}*I*_{E}*R*_{N}R_{LTR} model for a human life history

_{N}R

This model differs from the previous model by containing two recovered classes—one for individuals who gained immunity (EDI) to the exogenous virus from an endogenous retroviral insertion in the genome (which has subsequently undergone recombinational deletion, after the threat of the exogenous virus has passed, leaving the characteristic signature LTR) (*R*_{LTR}), and one for those that mounted a successful immune response (cellular/humoral immunity) to the exogenous retrovirus without endogenization (*R _{N}*). The two additional rates included in this model are

*and*

*θ**which determine the rate at which immunity arises. These rates determine the rate out of the*

*θ*′*I*

_{X}and

*I*

_{E}compartments, respectively. For individuals entering the

*R*compartment, immunity is not inherited, and their offspring return to the susceptible compartment. The

_{N}*R*

_{LTR}compartment is an absorbing state—there is no way out. All parameters and definitions used are summarized in table 1.

The dynamics of the full model can be understood by building on insights gained from analysis of the *SI*_{X}*I*_{E} model (figure 2). A typical run of the model, where a high proportion of the population ends up in the *R*_{LTR} compartment, is displayed in figure 3. The exogenous and endogenous infections initially lead to a decrease in population size. This is because the population consists of a large proportion of infected individuals that have lower fitness than individuals in the susceptible or recovered compartments. As individuals transition into the *R*_{LTR} recovered state the population starts to grow again and the proportion of individuals in the *R*_{LTR} compartment converges to approximately 80%. When the *SI*_{X}*I*_{E} model predicts a large proportion of the population is in the *I*_{E} compartment at equilibrium, then a larger proportion of individuals will be expected to transition into the *R*_{LTR} class for a given value of * αθ′* than when it predicts a smaller proportion. Similarly, a greater proportion of the population will end up in the

*R*compartment and subsequently back in the

_{N}*S*compartment for a given value of

*when the*

*θ**SI*

_{E}

*I*

_{X}model predicts a high proportion of individuals in the

*I*

_{X}class. Consequently, the dynamics of the full model are determined by the dynamics of the

*SI*

_{E}

*I*

_{X}model as well as the relative values of

*and αθ′. For a human life-style history, we found that a large proportion of individuals ended up in the*

*θ**R*

_{LTR}compartment, regardless of the values of the transition rates (figure 4 for a value of

*ϕ*_{S}= 0.95).

### (d) *SI*_{X}*I*_{E}*R*_{N}R_{LTR} model for faster life histories

_{N}R

The model describes the flow of individuals between compartments. If a large proportion of the population is to end up in the *R*_{LTR} compartment, a greater proportion of individuals need to flow from left to right in figure 1*c* than vice versa. The primary process generating flow from right to left is reproduction, with newborn individuals from the *I*_{X} compartment entering the susceptible compartment, along with newborns from the *I*_{E} compartment that have not evolved immunity, but that have lost the active endogenous virus at the locus in question due to recombinational deletion or mutation, and newborn individuals from the *R _{N}* compartment. As we increase the speed of the life history, with reproduction (

*) being greater than survival (*

*τ*

*ϕ*_{S}), we would expect to increase the flow of individuals from right to left. Figure 4 illustrates the effects of varying the parameter values of the

*SI*

_{X}

*I*

_{E}

*R*

_{N}R_{LTR}model outlined in figure 1

*c*. For fast life histories (where the birth rate (

*) is high and survival (*

*τ*

*ϕ*_{S}) is low), the flow of individuals into the

*S*compartment gets sufficiently high that the growth rate of the susceptible class dominates the dynamics and few individuals make it to the

*R*

_{LTR}class for nearly all values of parameters. Under these assumptions, it is apparently impossible to get a high proportion of the population in the

*R*

_{LTR}class. This is unlikely as there are examples of species with faster life histories, which display signatures of retroviral infection [3].

### (e) A modified *SI*_{X}*I*_{E}*R*_{N}R_{LTR} model

_{N}R

Both *SI*_{X}*I*_{E}*R _{N}R*

_{LTR}models predict a large proportion of the population in the

*R*

_{LTR}compartment for approximately human life histories (compare figures 4 and 6 for large values of

*ϕ*_{S}). For slow life histories, the modified

*SI*

_{X}

*I*

_{E}

*R*

_{N}R_{LTR}produces dynamics that are very similar to those observed in the original

*SI*

_{X}

*I*

_{E}

*R*

_{N}R_{LTR}model displayed in figure 3. However, the dynamics of the two models differ for faster life histories—compare for example figures 4 and 6 for values of

*ϕ*_{S}< 0.9. In the modified

*SI*

_{X}

*I*

_{E}

*R*

_{N}

*R*

_{LTR}model (figure 1

*d*), the only route back to the susceptible class occurs when

*is high, i.e. it is easy to acquire immunity to the exogenous retrovirus. This is illustrated in figure 5. For values of*

*θ**above 0.1, few individuals enter the*

*θ**R*

_{LTR}class, suggesting that if immunity to the exogenous virus can be acquired easily, we would not expect to see incorporation (assuming incorporation provides a route to immunity).

When * θ* is low (i.e. it is difficult to acquire immunity to the exogenous retrovirus), all individuals end up in the

*R*

_{LTR}class across a wide range of parameter values, even when

*(rate of loss) and*

*α**(rate of insertion) are small and equal. This result is unaffected by the life history of the host, except close to the limit (when*

*γ*

*ϕ*_{S}is close to zero or close to unity). Figure 6 shows how parameter values have relatively little influence on the proportion of the population that ends up in the

*R*

_{LTR}compartment across the range of life histories.

## 5. Discussion

ERVs and their signatures represent a significant proportion of vertebrate genomes, yet the role of disease in driving endogenous retroviral signatures to high frequencies has not previously been explored. Through the novel application of epidemiological models, we show that endogenous retroviral signatures can reach high frequencies at a particular locus, if integration provides an immunological advantage and the rate at which immunity is acquired to the exogenous virus is much higher in individuals that have the incorporated retrovirus, than the rate at which immunity arises to the exogenous virus in those individuals without an incorporated virus. Numerical analysis of our model suggests that an LTR at a locus can reach high frequencies when the rate at which immunity arises to the exogenous retrovirus is lower than the rate at which immunity arises with the ERV. This result provides theoretical support for the idea that endogenization can help a host evolve immunity to a retroviral infection [12,13,27,45]. There are a number of examples of ERVs conferring some immunity to related exogenous retroviruses through a variety of mechanisms. It has been demonstrated that in mice, expression of the Env protein from an ERV binds to the same cell surface receptors as a related exogenous retrovirus, effectively blocking the entry to the cell of the exogenous form [29]. Similar examples also exist in chicken and sheep [9,13,30]. Other examples target virus replication such as the Fv1 locus in mice [46,47]. A recent review highlights more examples of co-option of ERVs as a host defence mechanism to retroviruses [26]; with the abundance of genomic data available, there is no doubt that more examples will be discovered.

Our model reveals the range of parameter values that allow an LTR at a single location in the genome to reach fixation within a host population. But why do we see so many LTRs in many vertebrate genomes? We propose a hypothesis that once the endogenous process starts, an exogenous retrovirus incorporates at numerous locations in the host genome. At the majority (potentially all) of these locations the insertion provides no host benefit via promoters in the LTR or of the co-option of a viral gene. The ERV continues to produce virions until immunity (EDI) occurs or the exogenous counterpart is no longer a threat. At this point, mutation and recombinational deletion would build up at the loci where the virus had incorporated, leaving the LTR signatures of endogenous integration as the active virus is removed. If this process were to occur, then we would expect to see solo LTR formation occurring many times at the same locus, which has been previously described [8]. Additionally, if incorporation of new retroviral family was rare, then we might expect to see signatures of bursts of retroviral endogenization that occur during a period, followed by periods of inactivity. Existing data appear consistent with such a hypothesis. For example, the most recently active retroviruses in the human genome belong to the HERV-K family. The oldest members of this family integrated before the divergence of Old World monkeys and apes (approx. 25 Ma), but HERV-K elements have undergone several periods of expansion throughout primate evolution, with a number of insertions unique to humans, some of which are polymorphic [8,48–51]. However, there is little evidence of current active ERVs in screened individuals. It is possible that immunity to HERV-K has now evolved and spread through the population, that mutation and recombinational deletion have removed functional ERVs and we see numerous LTR signatures of past infections.

It is likely that there are other forces which would influence the subsequent spread through a genome after the initial invasion event by EDIs, for which our model does not account. For example, there may be interference between multiple insertions in the genome, and it is unlikely to be a straightforward additive effect of immunity: more insertions may not necessarily mean more immunity. As the various mechanisms of EDIs are better understood, these factors can then be incorporated into the model.

Most ERV insertions are ancient, and consequently it is difficult to say why those insertions were so successful at invading the host genome. In the absence of exogenous viral counterparts to these older insertions, we can only speculate as to how many of these ERV insertions would fall into the category of EDIs. There are very few real-time genome invasions, with an exogenous viral counterpart, to test this hypothesis. In humans at least (for which most data exist), for most HERV lineages, the main method of proliferation through a genome appears to be by reinfection [52], which would be in line with what we propose. However, we acknowledge that there are also additional copying mechanisms that contribute to the spread of a few HERV families within the genome, such as retrotransposition [53], and then there are those ERVs that possess a degraded *env* gene and whose main method of proliferation now appears to be transposition [54], on which our model would clearly have no bearing.

It could be the case that the majority of ERVs present in genomes are neutral and have drifted to fixation, and there are a number of models that describe these cases [55–57]. Our results do not contradict genetic drift as the main evolutionary mechanism in driving ERV insertions to fixation: we merely provide an alternative for consideration. In those instances where the insertion does confer immunity (EDI), our results suggest that retroviral signatures can reach high frequencies if immunity cannot be achieved in any other way (cellular or humoral). Realistically, if there is even a tiny selective advantage that increases the frequency of these insertions in the population, then it would greatly reduce the chances of these insertions being lost to drift, and in reality it could be the case that it is a combination of some immunological advantage and genetic drift that leads us to observe the numerous signatures of past retroviral infections for those particular ERV lineages involved in EDI.

All models are wrong; they are designed to be simplifying caricatures of reality and make assumptions. However, they can be useful in posing hypotheses to explain patterns in data. One process we do not include is that upon initial invasion of a genome by a retrovirus: the retrovirus would be present in the heterozygous state. We do not account for differences between heterozygotes and homozygotes as we are looking at conditions for the invasion of the insertion, and our results reveal that a large proportion of the population ends up in the *R*_{LTR} compartment regardless of the value of * α*, the parameter that will change as a function of the proportion of the population in the heterozygous/homozygous state.

Our model suggests potentially fruitful areas of future research. First, relatively little is known about rates of endogenization, of rates of mutation and recombinational deletion, and of rates of the acquisition of immunity outside of humans. Genomic studies on species of organism where ERVs are still active should help in answering these questions [9,10,13], as will comparative cross-species analyses of vertebrate genomes.

There are various other modifications we could make to the model that we do not explore here. Some viral infections impact fertility rates rather than survival rates [58–61]. We could incorporate such infections into our models by including compartment-specific fertility rates, much in the same way as we currently have compartment-specific survival rates. We strongly suspect that if we were to do this our general conclusions would not change unless the choice of compartment-specific fertility rates substantially impacted the rate of flow of individuals from right to left in figure 1*d*. The models we have developed do not incorporate specific mechanisms. We hope to incorporate more detailed immunological mechanisms into our future models, but in order to do that a better understanding is required of how immunity to retroviruses evolves, and the various mechanisms via which endogenization can aid the evolution of immunity, which may also vary with life history.

Under the scenario of retroviral insertions providing immunnity (EDI), our model reveals the conditions under which a provirus involved in such immunity would spread to fixation. Once the threat of the exogenous retroviral counterpart is removed, the selection pressure upon retaining the provirus is removed, and these insertions are then free to undergo recombinational deletion, leaving the LTR signatures of endogenous retroviral infection that are so frequently observed in host populations. Our findings suggest that ‘neutral’ insertions are likely to reach high frequencies, or fixation, when hosts can more easily evolve immunity with an endogenous version of the retrovirus to the exogenous type. If future research provides support for this hypothesis, it is possible that retroviruses have played a major role in the evolution of the vertebrate immune system with regard to this form of immunity.

## Acknowledgements

We would like to thank anonymous reviewers, Louise Johnson, Aris Katzourakis and Richard Nichols for comments and suggestions.

## Footnotes

One contribution of 13 to a Theme Issue ‘Paleovirology: insights from the genomic fossil record’.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.