## Abstract

Recently, several statistical methods for estimating fine-scale recombination rates using population samples have been developed. However, currently available methods that can be applied to large-scale data are limited to approximated likelihoods. Here, we developed a full-likelihood Markov chain Monte Carlo method for estimating recombination rate under a Bayesian framework. Genealogies underlying a sampling of chromosomes are effectively modelled by using marginal individual single nucleotide polymorphism genealogies related through an ancestral recombination graph. The method is compared with two existing composite-likelihood methods using simulated data. Simulation studies show that our method performs well for different simulation scenarios. The method is applied to two human population genetic variation datasets that have been studied by sperm typing. Our results are consistent with the estimates from sperm crossover analysis.

## 1. Introduction

Inferring how recombination rates vary across chromosomes is a fundamental problem in population genetics owing to its implications for evolutionary studies, disease association mapping and understanding the molecular basis of recombination. Pedigree-based estimates of recombination rate are typically at a coarse scale due to few informative meioses. Although sperm typing can provide fine-resolution estimates of recombination rate, the laborious nature of the experimental techniques limits its usefulness to comparisons on a small region (typically less than a megabase interval). Sperm typing is also restricted to estimates of male recombination. Recently, population-genetics-based methods have been applied to estimate sex-averaged recombination rates and to measure the changes in recombination rates over human genomes (reviewed in Hellenthal & Stephens 2006).

Although a number of population genetic models have been proposed based on the use of a coalescent process with recombination (Kingman 1982*a*,*b*; Hudson 1990), the inference methods that are currently widely used for inferring recombination rates are based on approximated likelihoods (typically a composite likelihood). Approximate-likelihood methods may provide consistent point estimates, but the likelihoods obtained do not have standard properties, and especially when the data contain limited information, the approximate-likelihood methods may not perform as well as methods that use the full likelihood. Full-likelihood methods have the advantage that they use all information contained in the data. However, existing full-likelihood methods (Griffiths & Marjoram 1996; Kuhner *et al*. 2000; Nielsen 2000; Fearnhead & Donnelly 2001) are too computationally expensive to be applied to large-scale genomic data (reviewed in Stumpf & McVean 2003).

The use of composite-likelihood methods in estimating *ρ* was first suggested by Hudson (2001), although similar ideas have previously been used in linkage disequilibrium (LD) mapping methods (reviewed by Rannala & Slatkin 2000). McVean *et al*. (2002) extended Hudson's (2001) composite-likelihood approach to allow recurrent mutation. Other approximate methods have also been developed. Li & Stephens (2003), for example, developed a method based on an approximation to the conditional likelihood, which they called the ‘product of approximate conditionals (PAC)’ likelihood. The two approximate-likelihood methods have recently been applied to study properties of recombination rate across human genomes (e.g. Myers *et al*. 2005; Graffelman *et al*. 2007).

Here, we develop a computationally tractable full-likelihood-based method in a Bayesian framework. In our method, the chromosomal intervals are treated as vectors of discrete points (corresponding to sampled marker sites), and the genealogy underlying a sample is effectively modelled by excluding non-ancestral recombination and lineages. We compare the performance of our method with the composite-likelihood method (implemented in `LDhat`) and the ‘PAC’ likelihood method (implemented in `PHASE`) using simulated data, and applied our method to two human population genetic variation datasets (Jeffreys *et al*. 2001, 2005).

## 2. Bayesian inference via single nucleotide polymorphism genealogies

Coalescent theory provides a general framework for population genetic inference (Kingman 1982*a*,*b*; Hudson 1990). We let *G* denote the genealogy underlying a sample of single nucleotide polymorphism (SNP) haplotypes or genotypes (denoted by ** X**). Let

**be a vector of parameters, including**

*Θ**θ*=4

*N*

_{e}

*μ*and

*ρ*=4

*N*

_{e}

*c*, where

*N*

_{e}is the effective population size;

*μ*is the site-specific mutation rate per generation; and

*c*is the recombination rate per generation in cM Mb

^{−1}. The posterior density of parameters,(2.1)can be estimated using Markov chain Monte Carlo (MCMC). However, developing inference methods based on the coalescent with recombination (ancestral recombination graph, ARG) is challenging because the problem is high dimensional (the probability of any one instance of the genealogy is small) and the dimension varies (due to the variable number of coalescent–recombination pairs). These factors can lead to inefficient parameter estimates because a large portion of the ARG is not related to the data and does not contribute to the likelihood calculations.

If ** X** represents the data for one marker site, the genealogy underlying the sample is a coalescent tree (e.g.

*τ*

_{0}). With

*k*linked markers (

*k*>1), the genealogy can be considered an array of correlated coalescent trees (e.g. ), which are jointly sufficient for the likelihood calculation. The ARG provides an indirect way to sample genealogies for each of the markers and is the biologically appropriate prior for the genealogies under a model of the coalescent with recombination.

We develop a method for estimating *ρ* using population samples of SNPs based on the ARG in a Bayesian framework. Both constant and variable recombination models are considered. We let *G*_{S} denote the joint multilocus SNP genealogy (described in §3) underlying the sample. Given *G*_{S}, the genealogical trees (** τ**) for each marker position can be obtained (figure 1). The posterior distribution of

**is(2.2)which can be numerically evaluated by MCMC.**

*ρ*### (a) Prior on genealogy (SNP genealogy, G_{S})

The most general representation of the ARG (Griffiths & Marjoram 1996) models the positions of recombinations as events on an interval of the real line. The probability density of recombination events on this interval depends on the spatial distribution of local recombination rates, hotspots, etc. Each recombination event separates a chromosomal interval into two disjoint intervals (split at the recombination point) that subsequently undergo independent coalescence and recombination processes. To analyse SNP markers, it is efficient to reduce this description to exclude chromosomal segments that are not ancestral to the sample and for which markers are not informative.

Let ** r**=[

*r*

_{0}, …,

*r*

_{k−1}] be a vector of the total recombination rate between adjacent marker sites for

*k*markers on an interval. Note that

*r*

_{j}=

*ρ*

_{j}×

*d*

_{j}, where

*d*

_{j}denotes the distance in Mb between markers

*j*and

*j*+1, and

*ρ*=

*ρ*

_{0}, …,

*ρ*

_{k−1}is a vector of the

*ρ*for each interval between markers, where

*ρ*

_{j}=4

*N*

_{e}

*c*

_{j}×0.01 is the rate in units of per cent recombination per Mb on the interval between markers

*j*and

*j*+1 (in time units of 2

*N*

_{e}generations). Let

*Z*_{i}={

*Z*

_{ij}} be a Boolean vector, such that

*Z*

_{ij}=1 if the

*j*th marker for the

*i*th lineage is ancestral to the sampled chromosomes and 0 otherwise. We refer to this as the marker ancestry (MA) vector.

If a recombination occurs on the interval between markers *j* and *j*+1, the two resulting lineages *p* and *q* receive MA vectors determined by taking the Hadamard products,(2.3)(2.4)where ** W** and

**are both vectors of dimension**

*V**k*with

*W*

_{l}=0 for all

*l*≤

*j*,

*W*

_{l}=1 for all

*l*>

*j*,

*V*

_{l}=1 for all

*l*≤

*j*and

*V*

_{l}=0 for all

*l*>

*j*. An example is illustrated in figure 2. A coalescence event involves Boolean vector addition of elements in the two descendent vectors (

*Z*_{p}and

*Z*_{q}),(2.5)where

*Z*_{A}is the MA vector for the ancestral lineage (A).

The total recombination rate at any point in time is the sum of recombination rates across lineages. Note that if a lineage *i* contains only one non-zero element in *Z*_{i}, the recombination rate becomes 0 (e.g. *Z*_{q} in figure 2). Consider the example illustrated in figure 2; the prior density of *G*_{S} at time *t*_{m+1} (in 2*N*_{e} generations) at which a recombination event occurred on lineage *i* between markers *x* and *x*+1, given *m* lineages at *t*_{m}, is(2.6)where *L* is the first marker at the left, with *Z*_{il}>0, and *R* is the last marker at the right, with *Z*_{il}>0, for lineage *i*, and indicates the recombination rate between markers *j* and *j*+1 on lineage *y*, obtained by multiplying *r*^{(y)} by *Z*_{y}. Equation (2.6) represents the joint probability of exponential waiting time and recombination breakpoint.

### (b) Prior on recombination rate

Currently, the pattern of recombination hotspots from several genomic regions has been studied by sperm typing (Arnheim *et al*. 2007). As was discovered from these studies, the majority of recombination clusters in narrow regions known as recombination hotspots and recombination only occasionally occurs in non-hotspot regions. Since only recombination rates between markers contribute to the probability of the genealogy, a general model of variable recombination is assumed such that *ρ*_{i} between markers *i* and *i*+1, , is independent with a common prior density,(2.7)where *f*_{H}(*ρ*_{i}) and represent the prior densities of *ρ* within a recombination hotspot and background recombination rate (non-hotspot regions), respectively. It is assumed that and , and *μ*_{H}=10, *σ*_{H}=1, and . The 0.025 and 0.975 quantiles for *f*_{H}(*ρ*_{i}) and are [3102.73,156 367.45] and [20.91,1053.60], respectively. Parameter *p*_{H} is integrated in the MCMC, since it is unknown for any given chromosomal intervals.

If assuming *ρ* is constant (e.g. across short intervals), since more data points (markers) contribute to the estimate of the parameter, a uniform distribution bounded at 0 is used as the prior on *ρ*. Similarly, the prior on *θ* is assumed to be uniform and bounded at 0. Although we try to use minimal prior information, the effect of different prior densities on *ρ* and *θ* needs to be further investigated.

By increasing the efficiency of computations on the ancestral graph using SNP genealogies to allow the use of a full-likelihood-based model, the accuracy of estimates of *ρ* can be improved by comparison with approximate likelihoods, since all of the information contained in the data is used for obtaining the estimates for full-likelihood methods. Prior independent knowledge of how recombination rate varies and how recombination hotspots are distributed across chromosomes obtained from other studies (e.g. by sperm typing; reviewed in Arnheim *et al*. 2007) could also be incorporated into the prior used in an analysis to make use of such information and obtain more refined estimates of recombination rate.

## 3. Simulation studies

Smith & Fearnhead (2005) performed a simulation study to evaluate the performance of three existing composite-likelihood methods (Hudson 2001; Fearnhead & Donnelly 2002; Li & Stephens 2003) for estimating population recombination rate using sequence data. Here, we focused on conducting simulation studies to examine the statistical performance of our method (implemented in the program `InferRho`) and to compare our method with two existing widely used approximate-likelihood methods implemented in `LDhat` (McVean *et al*. 2004) and `PHASE` (Li & Stephens 2003; Crawford *et al*. 2004). The first two simulation studies were aimed at examining the performance of the methods by assuming a constant recombination rate across intervals, and examining the effect of different SNP ascertainment criteria, using either haplotypes (simulation study 1) or genotypes (simulation study 2). The third simulation study examined the performance of the methods when allowing for variable recombination rates across intervals. For all three simulation studies, the sample size was set to be 50 chromosomes, and it was assumed that *N*_{e}=10^{4} and *μ*=10^{−8} per site per generation under a Jukes–Cantor model, both realistic values for humans. Three program packages were used for these comparisons: `InferRho` v. 1.0; `LDhat` v. 2.1; and `PHASE` v. 2.1.

### (a) Simulation study 1 (*S*_{1})

*S*

In the first simulation study, we generated 150 independent genealogies of 50 chromosomes and simulated complete sequences, each for an interval of length 45 kb. We further assumed a constant *c*=1.13 cM Mb^{−1}, which is the average recombination rate across the human genome estimated by segregation analysis on pedigrees (Kong *et al*. 2002). In order to conduct a realistic simulation study, given a simulated genealogy and samples, the first 10 markers that satisfied the SNP ascertainment criterion were taken as the sample for estimating *ρ*. Three marker ascertainment strategies were considered, including no ascertainment (no restriction on minor allele frequency (MAF), ), MAF≥0.05 () and MAF≥0.1 (). The length of the chromosome interval spanned by polymorphic markers is a random variable that varies across simulations, from a minimum of 2.15 kb to a maximum of 39.5 kb in *S*_{1}.

For analyses using `InferRho`, the mode of the posterior distribution was used as the point estimate and the highest posterior density region was used as the credible set. The number of iterations was set to be 10^{7}, and the parameters were sampled every 500 iterations using the last 5×10^{6} iterations. For analyses using `LDhat`, the program `pairwise` in the `LDhat` v. 2.1 package was used. The parameter *θ* per site was set to the default value provided by the program (a finite-sites version of Watterson's estimate), max 4*N*_{e}*r* was set to be 100 and the number of points on the grid was set to be 201. For analyses using `PHASE`, the flag `-MR3` was used to indicate a constant recombination rate. The `-X10` and `-k999` options were also used. The first is recommended to increase the number of iterations of the final run by 10, and the second is for indicating that haplotype phases in the sample are known. The number of iterations, thinning intervals and burn-in were set to be 10 000, 100 and 10 000, respectively, which are 100 times larger than the default settings. As was recommended, the median of the results from file `*.out_recom` was taken as the point estimate. The distributions of the point estimates of *ρ* using datasets , , obtained by use of the three programs are shown in figure 3.

### (b) Simulation study 2 (*S*_{2})

*S*

The second set of data (150 genealogies of 50 chromosomes) was simulated using the same simulation method and parameters as in *S*_{1}, except that haplotypes were randomly paired to create multilocus genotypes of individuals. The datasets are labelled as , , , corresponding to the three ascertainment criteria described above. In our method, we integrate over haplotypes in the MCMC and haplotype phases are jointly estimated. The posterior distribution of haplotypes for each genotype in the sample is reported.

Two parallel chains were run for the program `InferRho`, and the number of iterations, thinning interval and burn-in parameters were 3×10^{6}, 500 and 3×10^{6}, respectively. For `LDhat`, the program `complete` was used for generating the likelihood lookup table and `pairwise` was used for obtaining estimates of *ρ* for each sample. Because it is computationally expensive to obtain the likelihood table, the number of points on the grid was set to be 101, which is the default value, and other parameters are the same as in *S*_{1}. For `PHASE`, the `-k999` flag was removed but the remaining settings are the same as in *S*_{1}. The distributions of the point estimates obtained using the three programs are shown in figure 4. The average accuracy of haplotype phasing (the percentage of accurately phased genotypes in a dataset) across all three datasets obtained from `InferRho` and `PHASE` was 0.939 and 0.967, respectively.

### (c) Simulation study 3 (*S*_{3})

*S*

The third set of simulations assumes that one recombination hotspot exists near the centre of a 30 kb chromosomal interval, located at a position between 14 and 15.5 kb from the left of the interval. The strength of the hotspot is 25 cM Mb^{−1} and the background recombination rate is 0.5 cM Mb^{−1}. In total, 100 genealogies were simulated and polymorphic sites with MAF≥0.05 were used for the analysis. In this simulation study, all of the polymorphic markers spanning the region were used, so the number of markers in the sampled haplotypes varied (e.g. from 16 to 64 in *S*_{3}).

For analyses using `InferRho`, the number of chains, burn-in and number of iterations are the same as in *S*_{2}. The options `-MR0`, `-X10` and `-k999` were used for `PHASE`, and other iteration parameters are the same as in *S*_{2}. For analyses using `LDhat`, `complete` was used to generate the likelihood lookup table, and `interval` was used to estimate variable recombination rates. Default values for the iteration and penalty parameters `-its 10000000 -bpen 5 -samp 2000` were set for the program `interval`, and other parameters are the same as in *S*_{2}. The program `stat` in the `LDhat` package was used for summarizing the results with option `-burn 50` (corresponding to 10^{5} iterations) to obtain the point estimate and confidence interval of *ρ*.

The results from all three programs are summarized in table 1. Because the values of *ρ*_{i} vary between each marker, bias was estimated as(3.1)where *n* denotes the number of datasets (=100 in this study); *k*^{(i)} denotes the number of SNPs; and denotes the true *ρ* between markers *j* and *j*+1 in dataset *i*. Mean square error, coverage and the average width of 95 per cent credible set are calculated similarly. The estimates of *ρ*_{j} across the interval from the first 25 samples are plotted in figure 5.

## 4. Analysis of human leucocyte antigen and MS32 regions

We applied our method to two datasets from the human leucocyte antigen (HLA) and MS32 regions that have been previously studied by sperm typing (Jeffreys *et al*. 2001, 2005). The HLA dataset consists of 274 SNPs distributed across 0.216 Mb, sampled from 50 unrelated individuals. Six hotspots were revealed in the sperm-typing study (Jeffreys *et al*. 2001) and the data have been previously analysed using a composite-likelihood method (McVean *et al*. 2004). To reduce the computation time, the region is divided into sub-regions (each with 20 markers), although it is feasible to analyse the entire region simultaneously. The variable recombination model described above was used and the results obtained from `InferRho` are shown in figure 6. The estimates of the centres of the hotspots discovered by sperm typing are also indicated. Recombination rates obtained using population genetic data include both female and male rates, sex averaged over many generations, and only the population size-scaled recombination rate can be obtained (population size may vary across region of genomes due to the effects of selection, demographic events, etc.); we therefore would not expect the inferred intensities of hotspots from sperm-typing and population genetic inferences to agree completely. Nonetheless, the inferred locations of the hotspots obtained from sperm typing versus our method are very close. This is consistent with the findings from a recent study suggesting that recombination hotspot locations are similar between males and females (Coop *et al*. 2008).

Jeffreys *et al*. (2005) investigated recombination rates in the MS32 and surrounding region by both sperm-typing and coalescent analysis of genotypes (recombination rate estimated using `LDhat` and `PHASE`). The MS32 dataset consists of 206 SNPs sampled from 80 individuals and distributed across 0.206 Mb. For our analysis using `InferRho`, the region was again divided into sub-regions, each with 20 markers, and the results are shown in figure 7. The estimates are in general consistent with those obtained from sperm crossover analysis. In particular, hotspots MS32, MSTM1a, MSTM1b and MSTM2 discovered by sperm typing, which are only weakly evident from a `LDhat` and `PHASE` analysis (fig. 1*b* of Jeffreys *et al*. 2005), are very intense when these data are analysed by our method.

## 5. Discussion

Here, we present a new method for estimating recombination rate using a population sample of either haplotypes or genotypes. The genealogy of the sampled chromosomes is represented by a SNP genealogy, which is composed only of lineages carrying sites ancestral to the sample. The chromosome intervals in the SNP genealogies are represented by MA vectors. By using the SNP genealogies as a prior on the genealogy relating the sampled haplotypes, full-likelihood estimation of recombination becomes feasible. If little information is available about the recombination rate for the sampled interval, an uninformative prior on recombination rate can be used in the analysis. On the other hand, information on recombination rate obtained from other independent studies can be incorporated into the analysis through a more informative prior to obtain refined estimates of recombination rate, or possibly to address questions that cannot be obtained from either study alone (e.g. estimating female recombination rate by combining estimates of male recombination rate obtained from sperm typing and estimates of the sex-averaged recombination rate obtained by population genetic methods). The posterior distribution of recombination rates is approximated by a reversible-jump MCMC (RJMCMC). In the Metropolis–Hastings (MH) algorithm, proposed changes include modifying the SNP genealogy by changing a local topology or by adding (or removing) a pair of recombination and coalescent nodes, modifying ancestral alleles, modifying haplotypes (if the phase of the data is unknown), modifying alleles at sites with missing alleles in the sample and modifying the parameters *θ* and *ρ*. The method is implemented in the package `InferRho`.

Three simulation studies were conducted to evaluate our method and to compare its performance with two other approximate-likelihood methods, including a composite-likelihood method and a PAC likelihood method, implemented in packages `LDhat` and `PHASE`, respectively. The first (assuming haplotypes in the sample) and the second (assuming genotypes in the sample) simulation studies are aimed to examine the performance of three methods in the cases that the size of a chromosomal interval is relatively small and the recombination rate is constant across the interval. In this case, information in the data is limited due to the small interval size (or the small number of polymorphic sites), but all of the sites contribute to the estimate of *ρ* since a constant rate model is assumed. Three marker ascertainment criteria were considered. All three methods performed better as the number of informative markers increased (figures 3 and 4). As expected, the variance of the estimates of recombination rate obtained using genotypic data is larger than those obtained using haplotypes. The mode of the empirical distribution is centred on the true parameter value for our new method while the other methods have a mode near zero.

The third simulation study examined the performance of the methods when a recombination hotspot exists on an interval. The simulation scenario was chosen according to the recombination hotspot and background rate patterns obtained from previous sperm-typing studies. Because the number of markers on the haplotypes for each sample varies, samples with fewer polymorphic markers may contain less information about the recombination rate and hotspot. In general, more markers were more likely to yield accurate estimates of the location of a recombination hotspot. Table 1 shows the summary of the estimates of recombination rate across the region using all 100 simulated samples, and figure 5 shows the plots of recombination rate using the first 25 simulated samples obtained by use of the three methods. Our method has the smallest credible set with a coverage (0.935) close to the nominal value (0.95). In terms of MSE, `InferRho` and `PHASE` were very similar and both had MSE much lower than `LDhat`. The improved statistical performance of `InferRho` may be due to the fact that it uses the full likelihood and can therefore extract greater information from the data.

The fine-scale distribution of recombination rates in the HLA and MS32 regions has been studied previously by sperm typing. Although the sperm-typing studies only provide recombination estimates from males, the distribution of recombination hotspots obtained using population samples via our method is mostly consistent with those estimated by sperm typing. However, the intensity of the recombination hotspots obtained by the two different approaches is not completely in agreement. Reasons for this might be the differences in female and male recombination rates *per se*. Since only population size-scaled recombination rate (*ρ*=4*N*_{e}*c*×0.01) can be obtained from population genetic approaches, the difference in the strengths of recombination hotspots might also be due to the variation in *N*_{e} across the region owing to natural selection, migration, etc. This could lead to a larger or smaller *ρ* for population genetic inferences versus sperm typing even though *c* is identical in both cases.

For the simulation studies presented in this paper, a large number of iterations and multiple Metropolis-coupled MCMC (MCMCMC) chains were used to ensure convergence and to avoid the necessity of checking each run individually. The computational time needed depends on the data and the number of chains used. The use of MCMCMC notably improved the mixing of the chains. For the results presented in this paper, the data were analysed multiple times, and the results were highly consistent. The convergence was also analysed using the Bayesian output analysis (BOA) package of R (Smith 2005). As a general guide, 2×10^{6} burn-in and 2×10^{6} iterations can be used initially, although convergence should be examined using a post-chain analysis package such as BOA. Because our program allows the final state of the chain to be saved, more iterations can be added to previous runs if needed to ensure convergence. The program `InferRho` is available from http://rannala.org.

## Acknowledgments

This research was supported by NIH grant HG01988 to B.R.

## Footnotes

One contribution of 17 to a Discussion Meeting Issue ‘Statistical and computational challenges in molecular phylogenetics and evolution’.

- © 2008 The Royal Society