## Abstract

Hybrid zones are common in nature and can offer critical insights into the dynamics and components of reproductive isolation. Hybrids between diverged lineages are particularly informative about the genetic architecture of reproductive isolation, because introgression in an admixed population is a direct measure of isolation. In this paper, we combine simulations and a new statistical model to determine the extent to which different genetic architectures of isolation leave different signatures on genome-level patterns of introgression. We found that reproductive isolation caused by one or several loci of large effect caused greater heterogeneity in patterns of introgression than architectures involving many loci with small fitness effects, particularly when isolating factors were closely linked. The same conditions that led to heterogeneous introgression often resulted in a reasonable correspondence between outlier loci and the genetic loci that contributed to isolation. However, demographic conditions affected both of these results, highlighting potential limitations to the study of the speciation genomics. Further progress in understanding the genomics of speciation will require large-scale empirical studies of introgression in hybrid zones and model-based analyses, as well as more comprehensive modelling of the expected levels of isolation with different demographies and genetic architectures of isolation.

## 1. Introduction

Speciation is a fundamental evolutionary process that occurs by the evolution of reproductive isolation. There has been considerable recent progress documenting the genetics of reproductive isolation, and individual genes that contribute to isolation have been identified in several groups of organisms [1–3]. However, little is known regarding the genetics of speciation at the scale of a genome, including knowledge of the number and effect of loci that contribute to isolation, how these loci are distributed across the genome and how they contribute to genome-wide patterns of genetic variation. The study of hybrid zones is a powerful approach to address these questions.

Hybrid zones are common in nature, and offer direct observations of the evolutionary process of speciation and the genetic architecture of reproductive isolation [4–8]. When species diverge in allopatry and hybridize upon secondary contact, the products of meiosis and segregation in admixed individuals produce combinations of parental genotypes (or chromosomal blocks) that are tested by natural selection [9,10]. Hybrid zones offer tractable and important settings for the study of reproductive isolation because the introgression of foreign alleles is a direct measure of reproductive isolation. In contrast, population and species divergence often reflect the consequences of additional evolutionary processes, beyond those directly associated with reproductive isolation and speciation [11].

Introgression of loci will vary with their effect on fitness, and might take distinct forms reflecting the source and pattern of selection [9,12]. This includes variation in fitness caused by intrinsic hybrid incompatibilities or extrinsic selection. The geographical extent of introgression of loci (and linked genetic regions) that contribute to reproductive isolation should be reduced relative to the rest of the genome [4,9,13,14]. Conversely, introgression of loci that do not affect hybrid fitness (i.e. neutral loci) is affected by linkage disequilibrium, but should typically be more frequent and geographically extensive [9,14].

Introgression in hybrid zones has been quantified using geographical and genomic clines. Geographical clines describe the relationship between allele frequency or expected phenotype and geographical location [4,9,15–17]. Geographical clines are used to estimate the strength of the barrier to gene flow experienced by a neutral locus because of reduced hybrid fitness or other forms of selection, and allow inference of the number of genetic regions contributing to reproductive isolation [9,15,18]. Moreover, tests of geographical concordance and coincidence among genetic or phenotypic clines can detect variable introgression and have been used to identify loci potentially associated with reproductive isolation [19–21]. *Genomic clines* are mathematical functions that describe the probability of locus-specific ancestry along a gradient in genome-wide admixture or hybrid index, which is defined as the proportion of an admixed individual's genome inherited from one of two parental populations [12,22,23], with related models in earlier studies [15,24–26]. This is not a spatial gradient, rather nearly pure individuals of each parental type occupy opposite ends of the gradient. Thus, whereas geographical clines measure the movement of alleles across space, genomic clines measure the movement of ancestry blocks into different genomic backgrounds. Genomic cline models provide a way to generate a null-distribution for patterns of introgression that is conditioned on genome-wide admixture using a simple parametric model, a permutation method, or a hierarchical Bayesian framework [12,23]. This model framework can be used to identify outlier loci that are characterized by extreme introgression, and might be associated with adaptation or reproductive isolation.

Empirical studies of hybrid zones using these methods have repeatedly provided evidence for variable introgression among loci [18,21,24,27–29]. The observed variation among loci supports the hypothesis that reproductive isolation is an attribute of individual genomic regions rather than of the genome as a whole [30]. For example, sex chromosomes have been found to introgress relatively infrequently, consistent with theoretical expectations that such sex-linked loci should be important in speciation [18,19,31]. Both concordant and discordant patterns of locus-specific introgression in geographically separate hybrid zones have been reported [27,29,32]. Such comparisons allow analysis of how geographical, population and environmental variation contribute to the genetic architecture of isolation. Ultimately, comparisons of isolation in different settings and hybrid zones will enhance our understanding of the dynamics of species isolation and divergence, and the efficacy and polymorphism of isolating barriers [29,33,34].

Despite considerable progress in understanding the genetics of speciation, knowledge of the genome-level consequences and genome-wide signature of reproductive isolation is still in its infancy. In this paper, we begin to fill this gap by determining whether different genetic architectures of reproductive isolation lead to different and detectable patterns of introgression at a genome scale. To address this question, we first review and extend the recently developed Bayesian genomic cline model (described in §2*a*; [23]) to account for genomic autocorrelations in patterns of introgression for linked genetic regions. We then simulate admixture under different demographic conditions and a variety of genetic architectures of isolation, including reduced hybrid fitness owing to underdominance or epistasic incompatibilities, variation in the number and effect sizes of genes contributing to reproductive isolation, and variation in the genomic distribution of genetic factors affecting fitness. We analyse these simulations in the first place to determine the genetic outcomes of different models and summarize patterns of locus-specific introgression using the revised Bayesian genomic cline model. Finally, we discuss the implications for future genomic studies of the genetic architecture of isolation.

## 2. Model

### (a) Bayesian genomic cline model

The purpose of the genomic cline model is to quantify variable introgression of loci. With this model, we are specifically interested in the movement of parental ancestry segments into different hybrid backgrounds (measured by hybrid index) within an admixed population. Genomic clines are mathematical functions that quantify introgression of individual loci relative to a gradient in hybrid index. In other words, a genomic cline describes the probability of locus-specific ancestry given genome-wide ancestry or hybrid index. Gompert & Buerkle [23] recently described a Bayesian model for estimating genomic clines. This model contains two key genomic cline parameters, *α* and *β*, that are used to quantify variable patterns of introgression (figure 1). Additionally, these parameters form the basis for detecting outlier loci that might be associated with reproductive isolation. Genomic cline parameter *α* specifies an increase (*α* > 0) or decrease (*α* < 0) in the probability of locus-specific ancestry from parental population 1 and defines the centre of the cline (we assume two parental populations labelled as parental population 0 and 1). Genomic cline parameter *β* denotes the rate of change in the probability of ancestry along the genome-wide admixture gradient. *β* is the rate parameter and positive values of *β* specify a steeper cline, whereas negative values of *β* specify a wider cline (figure 1). Herein, we review the Bayesian genomic cline model before proposing an extension of this model for high-resolution, genome-wide genetic data. Readers interested in a more thorough description of this model should refer to Gompert & Buerkle [23].

The Bayesian genomic cline model assumes two parental populations are known *a priori*, and uses genetic data from these populations to estimate parental population allele frequencies. The model assumes Hardy–Weinberg and linkage equilibrium within parental populations (as in earlier studies [23,35] and similar models), but does not assume these populations are fixed for different alleles. Estimates of parental allele frequencies are necessary to estimate the ancestry for individuals from putatively admixed populations. Given the observed allele counts *ψ*_{0} and an uninformative Dirichlet prior, the posterior probability distribution for allele frequencies in parental population 0 (*π*_{0}) is:
where the product is taken across all *I* loci, and each Dirichlet distribution is parametrized by a vector of length *K*_{i} (*K*_{i} is the number of unique alleles observed for locus *i*). The posterior probability distribution for allele frequencies in parental population 1 is equivalently specified.

The genomic cline model assumes putatively admixed individuals are sampled randomly from each of *J* admixed populations, but no assumption is made about the geographical distribution of these populations. We model the ancestry of each allele copy in admixed individuals. Ancestry (**z**) simply denotes whether an allele was inherited from parental population 0 (*z* = 0) or parental population 1 (*z* = 1). The likelihood of the genetic data from the admixed populations (**x**) is conditional on ancestry and parental population allele frequencies:
2.1Here, *x*_{ijnak} is one if individual *n* has allelic state *k* for allele copy *a* at locus *i* (and diploid individuals have two allele copies at each locus), and *x*_{ijnak} is zero otherwise. When multiple admixed populations are included, *j* denotes the population from which individual *n* was sampled. The likelihood is a product across all loci, admixed populations (if more than one admixed population is included), individuals, and allele copies.

A genomic cline function specifies the prior probability of locus-specific ancestry from parental population 1, which is conditional on hybrid index and the locus-specific genomic cline parameters. This prior probability is denoted as *ϕ*. This probability is a simple function of an auxiliary variable *θ*, but has the imposed biologically meaningful constraints that *ϕ* must be bounded by zero and one and must be a monotonically increasing function of hybrid index. The mathematical details of the transformation between *ϕ* and *θ* can be found in Gompert & Buerkle [23]; here, we simply describe the polynomial function for the auxiliary variable *θ*. This function includes a term for hybrid index (**h**) and the genomic cline parameters *α* and *β* that were described previously (figure 1):
2.2Hybrid index is polarized such that an individual with *h*_{n} = 0 has ancestry only from parental population 0, and an individual with *h*_{n} = 1 has ancestry only from parental population 1.

The genomic cline parameters are allowed to vary by locus (*i*) and admixed population (*j*; if multiple admixed populations are analysed). Therefore, linear random-effect models are specified for *α*_{ij} and *β*_{ij}:
2.3and
2.4

Gompert & Buerkle [23] assumed hierarchical normal priors for the locus and nested population effects for the cline parameters: *γ* ∼ N(0,*τ*_{α}), *ζ* ∼ N(0,*τ*_{β}), *η*_{i} ∼ N(0,*ν*_{i}) and *κ*_{i} ∼ N(0,*ω*_{i}). Each of these hierarchical priors has a mean of zero. The precision parameters for these prior distributions describe the magnitude of variation in patterns of introgression among loci. When analysis is restricted to a single admixed population the nested population effects are dropped from the model, such that *α*_{i} = *γ*_{i} and *β*_{i} = *ζ*_{i}. To complete the Bayesian genomic cline model, uninformative priors are specified for the random-effect precision parameters (*τ*_{α}, *τ*_{β}, *ν* and *ω*) and hybrid index (**h**).

### (b) Intrinsic conditional autoregressive *ρ* prior for linkage

The choice of hierarchical normal priors for *γ*, *ζ*, *η* and *κ* assumes that each *γ*_{i}, *ζ*_{i}, *η*_{i(j)} and *κ*_{i(j)} is an independent sample from a genome-wide distribution (i.e. a normal distribution with mean zero and an estimated precision parameter). In other words, because these priors specify conditional independence for cline parameter random effects, values for *γ*_{i} and *γ*_{i}′ are assumed to be independent given the precision parameter *τ*_{α}. Linkage creates spatial genomic autocorrelation in the evolutionary history of loci [36]. Thus, the assumption of conditional independence is likely to be violated when high-resolution genetic data are used to study genome-wide divergence during speciation, and modelling autocorrelation will be highly relevant for detecting and interpreting outlier loci.

Therefore, we now model autocorrelation owing to linkage by specifying intrinsic conditional autoregressive *ρ* (ICAR*ρ*) priors for the genomic cline parameters. This form of statistical model is commonly used to account for spatial autocorrelation in geographical models [37,38] and is quite similar to the approach proposed by Guo *et al*. [39] to account for correlations owing to linkage when conducting *F*_{ST}-outlier genome scans. For the purpose of this paper, we describe ICAR*ρ* priors for the locus-specific random effects *γ* and *ζ*, but not the nested population effects. The extension of this prior to the nested population effects would be trivial. Accurate estimates of recombination rates between loci are required to implement the ICAR*ρ* model. The conditional forms of the ICAR*ρ* priors for *γ*_{i} and *ζ*_{i} are
2.5and
2.6In these equations *γ*_{[i]} is the collection of *γ*_{i}, for all *i*^{′} ≠ *i*, *ζ*_{[i]} is the collection of *ζ*_{i}, for all *i*^{′} ≠ *i*, *w*_{ii}′ is an entry from an *I* × *I* weight matrix (**W**), and *ρ* is a spatial dependence parameter that is included in part to ensure propriety of the posterior distribution [37,38]. The weight matrix (**W**) defines the expected correlations among genetic regions owing to linkage (genomic proximity). We impose a sum-to-zero constrain on *γ* and *ζ* to ensure identifiability of the parameters. Additional details of the ICAR*ρ* model are provided in the electronic supplementary material (see §S1, ICAR*ρ* model details).

### (c) Designating outlier loci

The Bayesian genomic cline model provides a framework to designate statistical outlier loci, which we define as loci with unlikely or extreme cline parameters given the appropriate hierarchical prior [23]. Outlier loci have aberrant patterns of introgression and might be associated with reproductive isolation [12,23]. We designate outlier loci with respect to genomic cline centre (*α*) or genomic cline rate (*β*). Following Guo *et al*. [39], these can be local (i.e. relative to nearby genetic regions) or global (i.e. relative to the entire genome) outlier loci. Locus *i* is a local outlier with respect to *α* if the posterior point estimate of *α*_{i} is not contained in the interval *q*_{N}, which is defined as the interval bounded by the *N*/2 and (1 – *N*)/2 quantiles of the ICAR*ρ* prior for *α*. Local outliers with respect to *β* are designated likewise. Designating global outliers requires a different reference distribution. Because we impose sum-to-zero constraints on the cline parameter random effects and because the mean genome-wide cline parameter should always be zero, we designate global outliers with respect to zero-centred distributions. Specifically, locus *i* is a global outlier with respect to *α* if the posterior estimate of *α* is not contained in the interval *q*_{N}^{*}, where *q*_{N}^{*} is the interval bounded by the *N*/2 and (1 – *N*)/2 quantiles of *N*(0,*τ*_{α} *w*_{i+}) (likewise for *β*).

## 3. Methods

### (a) Simulations

We simulated genetic data for admixed populations to determine whether different genetic architectures of reproductive isolation left distinct genomic signatures. We have used a related model for simulating admixed populations in other studies [12,23,40]. We were interested in three major contrasts regarding the genetic architecture of reproductive isolation: (i) the number and fitness effect of genetic loci contributing to isolation, (ii) whether isolation was the result of underdominance or epistatic incompatibilities, and (iii) the genomic location and distribution of loci associated with isolation.

We simulated underdominant selection and selection arising from pairwise epistatic interactions. For both forms of selection, we assumed that fitness was multiplicative and was the result of an individual's ancestry at *N*_{s} loci, and that each locus had an equal effect on fitness. Underdominant selection arises when the fitness of the heterozygous genotype is lower than that of either homozgyous genotype. We modelled underdominance by defining the relative fitness of an individual as *f* = 1 – (1 – *s*)^{xs}, where *s* is the reduction in fitness associated with having heterozygous ancestry at a single locus with an effect on fitness. *x*_{s} is the number of loci (out of the *N*_{s} loci) at which an individual had heterozygous ancestry. Epistatic interactions are fundamental for reproductive isolation owing to the accumulation of Bateson–Dobzhansky–Muller (BDM) incompatibilities [41–43], and are thought to play a significant role in the genetics of speciation [42,44,45]. We assumed a simple model of BDM incompatibilities, where interactions are between pairs of loci and the ancestral state at each locus is fitter [44,46]. For each pair of interacting loci, we assumed that parental population 0 is fixed for a derived allele at the first locus (*A*) and parental population 1 is fixed for a derived allele at the second locus (*B*; the ancestral genotype is *aabb*). Combinations of derived alleles at the interacting locus pair are incompatible and cause reduced fitness in hybrids. There are three different potential incompatibilities: homozygous derived × homozygous derived (*H*_{2}), homozygous derived × heterozygous (*H*_{1}) and heterozygous × heterozygous (*H*_{0}; [44]). The fitness effect of each incompatibility depends on dominance. To achieve the greatest effect possible with BDM incompatibilities, we assumed complete dominance of derived alleles *s* = *s*_{H2} = *s*_{H1} = *s*_{H0}, such that an individual's relative fitness was given by (where *x*_{H2} refers to the number of pairs of *H*_{2} interactions, etc.). The simulations modelled admixture between parental populations with fixed allele differences and no demic structure within the admixed population (these are not assumptions of the genomic cline model). We simulate a variety of demographic conditions (table 1; electronic supplementary material, S2 ‘Simulation details’).

### (b) Analyses

To assess the genome-level signature of reproductive isolation, we first summarized hybrid index and interspecific heterozygosity (i.e. the proportion of the genome with alleles at a locus inherited from different parental populations) across all individuals for each simulated dataset, directly from the simulation output. We also quantified mean locus-specific ancestry (i.e. the proportion of allele copies inherited from parental population 1) and locus-specific interspecific heterozygosity for each combination of simulation conditions. These parameters measure the effect of different genetic architectures of isolation directly from known simulation output, independent of the genomic clines model.

In addition to direct quantification of the simulation results, we estimated genomic cline parameters to quantify genome-wide variation in patterns of introgression for different genetic architectures of reproductive isolation. For each simulated dataset, we estimated cline parameters using the Bayesian genomic cline model with ICAR*ρ* priors described in §2. Posterior probability distributions for all model parameters were estimated using Markov chain Monte Carlo (MCMC). Computer software that implements MCMC estimation for the Bayesian genomic cline model parameters was written by the authors using C++ and the GNU Scientific Library [47]. For each simulated dataset, we obtained 22 500 MCMC samples from the posterior probability distribution, which were sampled every other MCMC iteration following a 5000 iteration burnin. We ran a single MCMC chain for each dataset, and visually assessed convergence to the stationary distribution and chain mixing by haphazard inspection of sample history plots. Parameter estimates were based on the median and 95 per cent equal tail probability interval of marginal posterior probability distributions. We designated local and global outlier loci as described in §2*c*, and using *q*_{N} = 0.05 for local outliers and *q*_{N}^{*} = 0.01 for global outliers (we used a more stringent cutoff for global outliers because it was much more common for loci to have extreme parameters relative to the global, zero-centred distribution than a distribution centred on the local mean).

The scale and form of genomic autocorrelations in genome-wide patterns of introgression could provide important information about the genetic architecture of reproductive isolation and effect our interpretation of loci classified as outliers. As a measure of autocorrelation for genomic cline parameters, we calculated Moran's I [48,49] at various recombination distances (measured in centimorgans) according to the following equation:
3.1In equation (3.1), the *r*_{ii}′ are binary variables that take on a value of one if the recombination distance between locus *i* and *i*′ is *k*1< *r*_{ii}′ ≤ *k*2, and otherwise take on a value of 0. *R*_{(k1,k2]} is the sum of all *r*_{ii}′ and *n*_{k} is the total number of loci. *I*_{(k1,k2]} was calculated for *β* in a similar matter. We calculated *I*_{(k1,k2]} for *α* and *β* with *k*1 = [0,50] with 2 cM increments, and *k*2 = *k*1 + 2 (or infinity for *k*1 = 50). Estimates of Moran's I were used to construct correlograms, which are plots of autocorrelation as a function of genetic distance.

## 4. Results

The distribution of hybrid indexes for each simulated admixed population was flat or unimodal (see electronic supplementary material, figure S1). Flat hybrid index distributions were more common when dispersal from the parental populations was high (*m* = 0.2), whereas unimodal distributions were observed more frequently when dispersal was low (*m* = 0.05; compare electronic supplementary material, figure S1*a*–*c* with *e*,*f*). Under most conditions, the mean hybrid index for each admixed population was close to 0.5 (50% ancestry from each parental population), but underdominance with one or several genes of moderately large or very large effect caused the distribution of hybrid indexes to shift towards zero or one (e.g. electronic supplementary material, figure S1*c*; table 1). Mean interspecific heterozygosity in the admixed populations tended to be close to 0.5, but a high dispersal rate from the parental populations increased the variance in interspecific heterozygosity among individuals (electronic supplementary material, figure S1; table 1).

Mean locus-specific ancestry and interspecific heterozygosity varied among genetic regions regardless of the simulated demographic conditions or genetic architectures of isolation (figure 2 and table 1). However, variation in both metrics was most pronounced with underdominance with one or several loci of moderately large or very large effect (e.g. figure 2*c,d*), or BDM incompatibilities with one or many interacting locus pairs of moderately large or very large effect (e.g. figure 2*f*). With these genetic architectures, genetic regions near loci that caused fitness variation had extreme values for mean ancestry (i.e. values with a greater absolute deviation from 0.5) and reduced mean interspecific heterozygosity. Consistent with the distribution of hybrid indexes, underdominance with one or several genes of moderately large or very large effect shifted mean ancestry genome wide.

Similar to the direct observations of ancestry and interspecific heterozygosity in the simulations, estimated genomic cline parameters varied among genetic regions (figure 3). We observed the most genome-wide variation in the genomic cline centre parameter (*α*) with underdominance involving one locus of very large effect (*s* = 0.65) or five loci of moderately large effect (*s* = 0.2) that were located near each other on a single chromosome, with underdominance and 25 generations of admixture, or with BDM incompatibilities involving one or many interacting locus pairs with *s* = 0.2 or *s* = 0.65 (figure 3 and table 1). We obtained similar results for the genomic cline rate parameter (*β*), but genome-wide variation was much greater with underdominance involving five loci of moderately large effect on a single chromosome than any other genetic architecture, and genome-wide variation was modest even after 25 generations of admixture when underdominance was associated with many genes of small effect (*s* = 0.01, *N*_{s} = 110). Genome-wide variation in *β* with BDM incompatibilities was also generally low, except with 20 interacting locus pairs of moderately large effect (*s* = 0.2). In general, genome-wide variation in genomic cline parameters was lower with higher dispersal rates from parental populations. The lowest levels of genome-wide variation for both *α* and *β* were associated with weak isolation overall or a diffuse genomic architecture of isolation (i.e. genes causing reduced fitness were spread evenly throughout the genome; figure 3 and table 1).

The total number of statistical outlier loci varied considerably for simulations conducted with different demographic conditions and genetic architectures of isolation (figure 3; electronic supplementary material, table S1). For example, high dispersal from parental populations (*m* = 0.2) led to very few outliers (a total of 10 (*s* = 0.01, *N*_{s} = 110) or 15 (*s* = 0.2, *N*_{s} = 5) across 10 replicate simulations), whereas over 449 outlier loci were detected for underdominance with *s* = 0.01, *N*_{s} = 110, *m* = 0.05 and *g* = 25. Outlier loci were more common with respect to *α* than *β*, and global outliers were much more common than local outliers (no local *β* outliers were detected; electronic supplementary material, table S1). The correspondence between outlier loci and genetic regions affecting fitness varied among simulation conditions and cline parameters. We define true outliers as loci within 10 cM of genetic regions contributing to isolation, however, this designation is arbitrary and is not meant as a general rule of thumb. Given this definition, global outlier loci were approximately 10 times more likely to be true outliers than false outliers with underdominance involving a single selected locus of very large effect (*s* = 0.65, *N*_{s} = 1, *N*_{ch} = 1 and *m* = 0.05, *g* = 10), five loci of moderately large effect and 25 generations of admixture (*s* = 0.2, *N*_{s} = 5, *N*_{ch} = 1, *m* = 0.05 and *g* = 25), five loci of moderately large effect that co-occurred on a single chromosome (*s* = 0.2, *N*_{s} = 5, *N*_{ch} = 5, *m* = 0.05 and *g* = 10), or BDM incompatibilities with two interacting loci of very large effect (*s* = 0.65, *N*_{s} = 2, *m* = 0.05 and *g* = 10; electronic supplementary material, table S1). However, for other simulated demographic conditions or genetic architectures global outlier loci were only about twice as likely to be true outliers rather than false outliers (e.g. underdominance with *s* = 0.2, *N*_{s} = 5, *N*_{ch} = 1, *m* = 0.05 and *g* = 10, or BDM incompatibilities with *s* = 0.2, *N*_{s} = 20, *N*_{ch} = 2, *m* = 0.05 and *g* = 10). Finally, when isolation was the product of genes with little individual effect, it was common for false and true outliers to be equally common (electronic supplementary material, table S1). Local outliers with respect to *α* were rare, but were nearly always true outliers (only three false outliers were recorded).

Genomic autocorrelation for cline parameters *α* and *β* was affected by demographic conditions and the genetic architecture of isolation (figure 4; electronic supplementary material, figure S2). Genomic autocorrelation (measured by Moran's I) was often greater than 0.9 and close to one for genetic regions separated by 2 cM (table 1). This result indicates that nearby genetic regions generally had similar patterns of introgression. We found considerable variation among model conditions for the rate at which genomic autocorrelation decreased with increasing recombination distance. For example, the mean genetic distance required for genomic autocorrelation to drop below half of *I*_{(0cM,2cM]} (denoted as *I*_{1/2}) was 32.8 (*α*) or 48.2 cM (*β*) for underdominance involving five linked loci with *s* = 0.2, but was only 16.2 (*α*) or 20.8 cM (*β*) for underdominance with five unlinked loci of weak effect (*s* = 0.01; table 1). Moreover, with some genetic architectures, genomic autocorrelation rapidly approached zero or took on negative values as the genetic distance between loci increased (e.g. or BDM incompatibilities with 20 loci of moderately large effect), whereas we observed values of *I* > 0.5 for all genetic distances less than 50 cM for other simulation conditions (e.g. *α* for underdominance involving one locus with *s* = 0.65; figure 4; electronic supplementary material, figure S2). Negative genomic autocorrelations arise because of epistatically interacting loci and because *α* and *β* parameters are constrained to sum to zero across loci.

## 5. Discussion

The results indicate that different genetic architectures of reproductive isolation often, but not always, leave distinct genomic signatures in admixed populations. We observed increased heterogeneity in patterns of introgression among genetic regions when reproductive isolation was caused by one or several genes with moderately large to very large fitness effects, rather than many genes with small fitness effects. Similarly, genome-wide heterogeneity in the rate and form of introgression was higher when genetic regions affecting fitness were clumped together, rather than spread out across a greater number of chromosomes. For comparable selection intensities, underdominance and BDM incompatibilities caused similar levels of genomic heterogeneity in patterns of introgression, although the former had a greater effect on genomic cline rate (*β*). Patterns of genomic autocorrelation in cline parameters were complicated, with different genetic architectures often leading to moderately to strikingly different rates of change in autocorrelation with genetic distance. Genetic architectures of isolation based on underdominance with the fewest loci contributing to isolation, particularly those with a single locus of very large effect, or BDM epistasis with interacting locus pairs on different chromosomes caused the greatest long-range genomic autocorrelation in cline parameters, whereas BDM epistasis with interacting locus pairs on the same chromosome resulted in some of the lowest long-range genomic autocorrelations (because of selection for different parental ancestry at each locus).

BDM incompatibilities caused greater genomic heterogeneity in *α* (genomic cline centre) than *β* (genomic cline rate) and the same was often true of underdominance. This observation highlights an important distinction between geographical and genomic introgression. The geographical extent of introgression is reduced for loci causing BDM incompatibilities or underdominance [9,50]. Under a few conditions, we obtained a similar pattern whereby alleles inherited from one population did not introgress into the alternative genomic background (positive *β*). However, the observed genomic heterogeneity in *α* indicates that, within the admixed population, introgession of loci causing BDM incompatibilities or underdominance was often extreme and ancestry blocks from one of the parental populations achieved a high frequency across the hybrid index gradient at the expense of ancestry blocks from the other parental population. With underdominant selection, *α* for selected loci tended to deviate in a single direction from expectations based on genome-wide admixture (i.e. all positive or all negative). Whether ancestry from parental population 0 or 1 was favoured varied stochastically among simulations. This finding indicates that drift during the first few generations affected the genomic composition of the admixed population, which influenced whether alleles inherited from parental population 0 or 1 had a higher marginal fitness. With BDM incompatibilities, positive values of *α* were often associated with the first locus of each pair and negative values of *α* were often associated with the second locus of each pair. This pattern suggests selection increased ancestry associated with the ancestral alleles (*a* and *b*) at each locus. This should not be surprising as the ancestral alleles had higher marginal fitness than the derived alleles (*A* and *B*), which causes a form of directional selection. When considering a single locus, underdominant selection and BDM incompatibilities can affect *α* in the same manner as directional selection [23], which suggests it might not be possible to discern the specific form of selection affecting a locus. Thus, previous attempts (including our own) to associate a specific form of selection (i.e. underdominance, epistasis or directional selection) with a locus based on the sign or magnitude of cline parameters were perhaps overly simplistic and naïve [12,26,27,29].

The simulation results indicate that the genomic signature of reproductive isolation is also profoundly affected by demographic conditions. For example, relative to *m* = 0.05, increased dispersal from parental populations (*m* = 0.2) caused reduced heterogeneity in patterns of introgression, but slightly increased long-range genomic autocorrelations, particularly for genomic cline rate parameter *β*. This is not surprising, as increased dispersal should retard shifts in ancestry owing to selection in admixed individuals, and inject parental chromosomal blocks and increase admixture linkage disequilibrium [15,40,51]. Conversely, relative to 10 generations of admixture, 25 generations of admixture caused increased heterogeneity in patterns of introgression across the genome and reduced long-range genomic autocorrelations. Again, this result could be expected, as an increased number of generations allows more time for recombination in admixed individuals to break up parental haplotype blocks. These results demonstrate that our ability to map the genetic architecture of isolation in a hybrid zone very clearly depends on its demographic history.

Demographic conditions also affected genome-wide admixture. Although the form and strength of isolation can affect the distribution of hybrid indexes in an admixed population, we found that this distribution was altered to a great extent by the dispersal rate from parental populations. With high dispersal, we generally observed a flat (uniform) distribution of hybrid indexes (presumably higher dispersal rates would have led to a bimodal distribution), whereas low-dispersal rates resulted in a unimodal distribution. Although these results are consistent with the hypothesis that bimodal hybrid zones are associated with nearly complete (prezygotic) reproductive isolation [7], they suggest that this might be an oversimplification. Rather, the results indicated that the modality of a hybrid zone is likely a function of the strength and form of isolation, as well as the dispersal rate from (or geographical overlap of) parental populations. High dispersal rates or geographical overlap of parental populations coupled with strong isolation should yield a bimodal hybrid zone, whereas even with strong isolation a geographically isolated admixed population will often have a unimodal distribution of hybrid indexes, which might be centred well away from 0.5. Thus, it is difficult to draw inferences about the nature of isolation based on the observation of a unimodal hybrid zone. Interpretation would require an understanding of the geographical context in which the unimodal hybrid zone occurs.

The results suggest that our ability to identify genetic regions associated with reproductive isolation in admixed populations based on variable introgression depends on population demography and the genetic architecture of isolation. Specifically, when reproductive isolation is caused by a modest number of genetic regions with moderate to large effects on fitness, global outliers are enriched for genetic regions associated with isolation. The same is true for local outliers, when they occur. However, when isolation is caused by genes with small fitness effects (whether few or many genes), outliers are at best marginally enriched for genetic regions associated with isolation. An optimistic assessment of these results is that variable introgression can be used most effectively to map the genetic architecture of isolation when the identity and location of genes contributing to isolation would be most interesting (i.e. when these genes have moderate to large effects on fitness). Even under these circumstances, it will be difficult to identify the true loci underlying reproductive isolation as their effect can extend over large portions of a chromosome. When patterns of introgression are relatively homogeneous, little confidence should be placed in individual outliers. But this pattern, if coupled with independent knowledge indicating moderate to strong overall isolation between hybridizing populations, would provide preliminary support that isolation is the result of many genes with small effects on fitness. Finally, the results show that a greater number of generations of admixture can radically increase the correspondence between outlier loci and the genetic architecture of isolation, whereas high rates of dispersal from parental populations can retard evolution in the admixed population to the extent that outliers are very rare, making the genetic architecture of isolation difficult to map.

We simulated complicated genetic architectures of isolation, but the genetic architecture of isolation in many natural hybrid zones is likely more complex. For example, documented genome-wide variation in genomic clines from natural populations certainly tends to be greater than that observed from these simulated data sets [23,27,29]. Moreover, each simulation involved a single form of selection and equal fitness effects for all selected loci. Neither of these assumptions is expected to hold in natural populations. Nonetheless, we have uncovered considerable complexity in patterns of introgression that result from even these relatively simple simulations. This suggests that the study of the genomics of isolation in hybrids will benefit from additional theoretical work to further develop our expectations for the genomic consequences of different isolating barriers. The complexity also suggests that specific evolutionary causes of heterogeneous introgression among loci should be inferred with caution.

Our findings point to both promise and potential limitations for discovering the genomics of speciation by studying admixed populations. Further progress will require genome-wide studies of introgression in hybrid zones, coupled with appropriate model-based analyses. Indeed, genome-wide sequence data have recently been published for several species known to hybridize in nature [52–54]. For example, Nadeau *et al.* [55] report genetic divergence between hybridizing lineages of *Heliconius* butterflies associated with colour pattern genes. These and other systems are well suited for genome-wide surveys of introgression in hybrid zones, which will complement studies of genome-wide divergence.

With increasing sequence coverage of genomes, model-based population genomic analyses of hybrid zones will need to account for linkage among genetic regions. We have proposed one way of doing this by specifying ICAR*ρ* priors for cline parameters. Hidden Markov models (HMM), which explicitly model correlated parameter states or evolutionary models along a chromosome, provide an alternative method [36,56,57]. It would be possible to introduce a HMM for the genomic cline parameters (*α* and *β*) with a finite number of model states. Falush *et al.* [58] proposed a Markov model for ancestry along a chromosome in admixed individuals. This model could conceivably be combined with ICAR*ρ* priors for cline parameters to model correlated patterns of introgression and admixture linkage disequilibrium in a complementary manner. We intend to explore this possibility in future work.

Finally, interpreting genetic patterns in the context of the complex genomic environment in which they are embedded (physical linkage relationships, density of genes and other functional regions, etc.) will advance our understanding of speciation. For example, a study of genomic divergence and local and long-distance linkage disequilibrium in threespine stickleback populations demonstrates the utility of analysing associations among loci, which leads the authors to suggest a new model for evolution in sticklebacks [59]. Similarly, our simulation results indicate that statistical measures of association across the genome (our genomic autocorrelations) capture aspects of genome introgression that have previously received little attention in the genomics of speciation, but that might support critical future insights.

## Acknowledgements

We thank Patrik Nosil and Jeff Feder for organizing this special issue and making this contribution possible. We thank two anonymous reviewers, Patrik Nosil and Jeff Feder for comments that improved this manuscript. We also thank Jeffrey Lang and the Department of Geology and Geophysics (University of Wyoming) for access to the Seismic computer cluster. This work was supported by NSF DDIG 1011173 to Z.G. and NSF DBI 0701757 to C.A.B.

## Footnotes

One contribution of 13 to a Theme Issue ‘Patterns and processes of genomic divergence during speciation’.

- This journal is © 2011 The Royal Society