Abstract

Genome scans have become a common approach to identify genomic signatures of natural selection and reproductive isolation, as well as the genomic bases of ecologically relevant phenotypes, based on patterns of polymorphism and differentiation among populations or species. Here, we review the results of studies taking genome scan approaches in plants, consider the patterns of genomic differentiation documented and their possible causes, discuss the results in light of recent models of genomic differentiation during divergent adaptation and speciation, and consider assumptions and caveats in their interpretation. We find that genomic regions of high divergence generally appear quite small in comparisons of both closely and more distantly related populations, and for the most part, these differentiated regions are spread throughout the genome rather than strongly clustered. Thus, the genome scan approach appears well-suited for identifying genomic regions or even candidate genes that underlie adaptive divergence and/or reproductive barriers. We consider other methodologies that may be used in conjunction with genome scan approaches, and suggest further developments that would be valuable. These include broader use of sequence-based markers of known genomic location, greater attention to sampling strategies to make use of parallel environmental or phenotypic transitions, more integration with approaches such as quantitative trait loci mapping and measures of gene flow across the genome, and additional theoretical and simulation work on processes related to divergent adaptation and speciation.

1. Introduction

As research on speciation genetics broadens to include speciation genomics, a number of models have been proposed to explain genomic patterns of differentiation between populations or species. These include the notion of ‘genomic islands’ of speciation [1,2], in which differentiation is maintained via divergent selection or reproductive isolating barriers in portions of the genome while the remainder is permeable to gene flow; and divergence hitchhiking [3,4], in which divergently selected loci can act as nuclei for increased differentiation at linked neutral sites. More generally, information about the proportion of the genome contributing to divergent adaptation and isolation, as well as the nature of the genes and genomic regions involved, is fundamental to our understanding of the process of speciation. It informs theories of the geography and temporal dynamics of speciation, the role of gene flow in preventing or enhancing adaptive divergence, and the relative contributions of selection and drift to divergence and reproductive isolation [5].

Empirical methods for examining patterns of genomic divergence have dramatically expanded in the last decade, on the heels of technological advances that have vastly increased the extent of genomic coverage possible in marker-based studies. Rapid declines in the costs of sequencing and genotyping technologies make it feasible to assay hundreds or thousands of loci for essentially any organism, and for better-characterized species sequence-based markers can be aligned to high-density genetic maps or sequenced genomes. One increasingly common approach is to compare genomic patterns of diversity within populations and/or divergence between populations at these markers, a process commonly called a genome scan, in order to identify genomic regions that do not conform to expectations based on some neutral demographic model. These regions are then considered candidates to contain one or more loci under selection, and they can be investigated in further detail in order to identify the precise target, mode and genomic or phenotypic consequences of selection.

Here, we consider genome scan and related studies that examine genomic patterns of adaptation, divergence and reproductive isolation, drawing particularly on empirical examples in plant populations or species. The main goals of this review are to (i) discuss factors that may influence patterns of genomic divergence; (ii) consider how these factors may impact the efficacy of genome scan approaches, and how best to control them; and (iii) summarize genome scan and related studies that have been performed in plants, and discuss conclusions that can be drawn from them.

2. Patterns of genome divergence and their interpretation

The genome scan approach was first explicitly described by Lewontin & Krakauer [6], who noted that all loci should be similarly affected by demographic processes, whereas selection should only affect a subset of loci. Accordingly, given an expected neutral distribution of divergence values, outliers can be considered candidate markers for genomic regions under selection. However, theoretical and empirical work suggests that numerous factors can complicate this simple expectation, and the resulting inferences that can be drawn from genome scan studies [79]. These factors include the biogeographic history of contact between populations (especially whether and when gene flow occurred between them), the time frame and strength of selection acting on differentiating loci, the genetic architecture of traits under selection, and the population structure and demographic history of populations being analysed [10]. These factors can influence both the expected patterns of genomic divergence, and the ability of genome scan approaches to identify loci involved in adaptive differentiation or speciation. Although a complete review is beyond the scope of this paper, we briefly outline some of these considerations here.

(a) Expected patterns of genomic divergence

The interpretation of genome scans depends, in part, on understanding the patterns of genomic differentiation that are expected under realistic biological scenarios. In some cases there are well-developed expectations; however, in others the geographical, temporal and selective details can make clear predictions difficult. For example, patterns of genomic divergence are expected to be strongly influenced by levels of gene flow as well as the strength of divergent selection at individual loci. When speciation begins in sympatry or parapatry, ongoing gene flow can homogenize most of the genome, resulting in genomic divergence at only the one or few loci experiencing strong divergent selection. In contrast, allopatric speciation does not require strong divergent selection at a small number of loci, and a larger portion of the genome may diverge through a combination of divergent selection, differential response to similar selective pressures and genetic drift [2].

In addition, in the presence of gene flow, it has been suggested that additional divergence may build up around the loci subjected to strong divergent selection because of reduced effective recombination rates around these loci [11]. This could potentially result in clustering of loci contributing to reproductive isolation or adaptive divergence and an increase in the size of divergent regions through time, a phenomenon that has been referred to as divergence hitchhiking (hereafter DH; [4]). Using simulations, Feder & Nosil [3] found that DH could potentially play an important role under some circumstances, especially during intermediate stages of speciation when the number of loci under selection is fairly small. However, as they point out, their simulations dealt with equilibrium conditions in which accumulation of additional adaptive divergence near already selected sites—a key aspect of DH—was not considered. Feder et al. [12] expanded their simulations to include the possibility of new selected mutations accumulating, and again found that DH was restricted to a limited range of conditions. Instead, they suggested that additional adaptive divergence was more strongly influenced by interactions between selection and migration than by linkage to established adaptively divergent sites.

While these simulations are informative, it would be valuable to consider several additional factors that may influence patterns of genomic divergence, including the effects of a period of allopatry prior to secondary contact, population structure within species and asymmetric selective effects. We suspect that the latter two factors will reduce the efficacy of DH, especially with respect to divergence at linked neutral sites.

Empirical evidence for DH is currently limited. This is likely partly owing to the difficulty of distinguishing the effects of DH from other processes of genomic divergence. For example, recent admixture can yield large islands of divergence [13] that might be misinterpreted as evidence of DH. On the other hand, evidence for DH might be missed because genomic regions affected by DH may include low and high FST markers owing to variance in coalescence times [14]. Via [14] suggests the following two predictions of DH: (i) low-FST markers in DH regions will have lower gene flow than low-FST markers outside of DH regions and (ii) low-FST markers in DH regions will more closely reflect species relationships than those outside of DH regions. However, it is not clear to us that either prediction distinguishes between a single large region of genomic isolation and multiple smaller independent regions that individually encompass one or more low FST markers. An alternative and potentially more powerful approach for estimating the impact of gene flow on genomic differentiation would be to compare sympatric and allopatric patterns of divergence in closely related taxa.

Expected patterns of genomic divergence between species that initially diverged in allopatry and later come back into secondary contact are likely to be even more complex. If the period of allopatry is brief, patterns of divergence are likely to be similar to initial divergence with gene flow [4]. However, if the period of allopatry is longer, and especially if the populations face divergent ecological conditions, patterns of introgression upon secondary contact are harder to predict and will depend critically on the strength and genomic architecture of adaptive divergence and isolation, as well as the duration and spatial pattern of secondary contact. Long-term equilibrium patterns of genomic divergence following secondary contact may be indistinguishable from those expected with divergence in the presence of gene flow [15]. However, it may take many generations of contact to reach such equilibrium.

Finally, the degree to which adaptation occurs from standing variation versus newly arisen variants (sometimes referred to as soft selective sweeps versus hard selective sweeps) also affects patterns of genomic divergence. Adaptation from standing variation is likely to occur under a wide range of circumstances; in some cases, it may be the predominant source of adaptive variation, especially when population mutation rates are high [9]. Importantly, a newly arisen variant is initially found on a single genetic background, and if it is swept rapidly to fixation, there will be a strong signal of hitchhiking, with little variation at linked sites and high divergence between populations in different selective environments. In contrast, if a variant that has segregated long enough to recombine onto multiple genetic backgrounds is driven to fixation, the decrease in diversity at linked sites will be less pronounced. And if divergent adaptation occurs through moderate changes in allele frequencies at multiple sites, it is likely that none of these sites will exhibit substantial divergence between populations or species [16].

(b) The efficacy of genome scans and their interpretation

Clearly, patterns of genomic divergence are influenced by a range of geographical, temporal, demographic and selective factors. Because of these effects, analyses of genome scan data can potentially provide over-, under- or biased estimates of the number and type of loci involved in adaptation and speciation, and the kinds of selective processes or genetic architectures involved.

To date, the most attention has been paid to the effects of population structure on outlier detection, as it can potentially result in an increase in false-positives. Early simulations implied that genome scans should be robust to a number of equilibrium and non-equilibrium population structures and histories as well as among-locus mutation rate variation [17,18]. However, more recent simulation studies have presented a less-positive picture, suggesting that population structure should be considered in more detail. For example, Excoffier et al. [7] found that hierarchical population structure could cause a substantial increase in the rate of false-positives if not accounted for, and Foll & Gaggiotti [8] showed that the inclusion of isolated populations that have experienced strong bottlenecks can also result in increased false-positives.

In some cases, remedies for these demographic complications might be relatively straightforward. For example, sampling designs that minimize population structure and/or sample multiple habitat or trait transitions could reduce the likelihood of false-positives. Likewise, the development of analytical methods that are more robust to violations of demographic assumptions would be valuable; ideally, these methods would directly incorporate an estimated demographic history instead of having a simpler history imposed [8], and/or explicitly include demographic history in the estimation process [19]. Because of the difficulties involved when the demographic history of a population is unknown, in some instances, it may be preferable to take an empirical approach [20] and rank the genes within the genome on the basis of pair-wise genetic diversity or genetic distance. This empirical approach is probably most useful in high-density scans, where there are many candidates and where the primary goal is the identification of genes with a recent history of selection.

In addition to demographic effects, the number of loci under selection can influence the interpretation of genome scan results. Most genome scan methods assume, at least implicitly, that relatively few loci are under selection. For example, the baseline FST upon which simulations are contingent is generally assumed to reflect neutral divergence. In many cases, this may be a reasonable assumption; but if selection is pervasive, it is clearly not. For example, Michel et al. [21] found that the majority of markers were linked to loci under divergent selection between host races of Rhagoletis pomonella, but very few were identified in an outlier analysis. Likewise, genomic admixture studies have shown that large proportions of the genome can be under selection (see below). The fact that ‘neutral’ FST estimates tend to be fairly low in plant systems studied so far (table 1 and electronic supplementary material, S1) may suggest that this phenomenon is rare; however, FST estimates between R. pomonella host races were also found to be quite low despite widespread divergent selection.

View this table:
Table 1.

Summary of representative studies taking genome scan approaches in plants. In some cases, different sets of outlier loci were identified based on different population comparisons, methodologies or significant cut-offs; we have reported the outliers the authors considered most relevant or promising where possible. Outliers are of high divergence unless otherwise specified; DOA, divergence outlier analysis; SFS, site frequency spectrum; AFLP, amplified fragment length polymorphism; QTL, quantitative trait locus; BDM, Bateson–Dobzhansky–Muller; SNP, single nucleotide polymorphism; CAPS, cleaved amplified polymorphic sequence; RAPD, random amplified polymorphic DNA. Additional details are in electronic supplementary material, S1.

The prevalence of soft versus hard selective sweeps also impacts the efficacy of genome scans because the former leaves a less pronounced mark on patterns of genetic diversity and divergence (discussed above), and therefore are less likely to be detected with genome scan methods [16,49]. Traits are also expected to differ in their likelihood of being affected by soft versus hard sweeps depending on their genetic architecture [49]. Soft sweeps are more likely to involve polygenic traits or mutations of smaller effect [16,50]. In addition, for polygenic traits significant phenotypic evolution can occur without large allele frequency changes at any locus if covariances among loci contribute substantially to phenotypic variance [51]. Thus, loci identified in genome scans for selection probably reflect an unrepresentative subset of traits whose genetic architecture lends itself more easily to such detection [49]. A number of methods have been suggested to distinguish adaptation from standing variation versus new mutations, but they often require restrictive assumptions about population demography and the strength and origin of selective sweeps and are difficult to apply in many circumstances [50].

Another important consideration is the metric employed for outlier detection. Most analytical methods, theoretical treatments and empirical studies have focused on FST, which is a widely used index of differentiation. However, FST depends on both within- and between-population variation and thus the precise cause of FST outliers can be difficult to infer. Other metrics such as reduced genetic diversity or absolute allele frequency differences will better diagnose recent selection. The former can also provide insights into the direction of selection [52]. On the other hand, if the interest is in the genetic architecture of reproductive isolation, then coalescent estimates of unidirectional migration rates will be superior to FST in detecting regions of the genome with low and/or asymmetric migration rates, as may be the case when hybridizing populations are isolated by Dobzhansky–Muller incompatibilities [24].

In addition to these sampling and analytical issues, the efficacy of genome scans can be impacted by the kinds and density of molecular markers employed and knowledge of the genomic location of these markers. The importance of marker density is obvious—many selected loci will be missed in low-density scans and the ability to identify the actual genes or sites that are under selection is greatly diminished. In addition, we find it worrisome that estimates of the proportion of outlier loci and the size of outlier regions tend to be larger in low-density than in high-density scans, perhaps implying that the former are not sufficiently conservative (electronic supplementary material, S2). The value of information on the genomic location of markers is self-evident as well because genome scans are likely to reveal multiple outliers corresponding to a single selective event. However, shared causation would not be evident without a high-density map or sequenced genome.

Genome scans are a relatively new approach, so it is to be expected that more realistic and appropriate models and analytical frameworks will continue to be developed. However, even with such advances, it is clear that the best inferences will be from those systems in which there are additional, independent, data on factors such as the genetic architecture of differentiation, the history of population contact and gene flow, the relevant selective forces, and/or the structure and demographic history of populations being examined [53]. Such data can be used, for example, to arbitrate on results of genome scans that would otherwise be ambiguous or potentially biased. With these considerations in mind, below we discuss the current literature on genome scans in plant systems, and then suggest further integrated approaches that can aid in understanding and interpreting genome scan data.

3. Genome scans and related studies in plants

Based on our summary of a number of recent plant genome scan studies (table 1 and electronic supplementary material, S1), the proportion of loci identified as outliers ranges from 0.4 to 35.5 per cent, with an average of 8.9 per cent. These numbers are comparable with an earlier survey that focused more on animal systems [2] and found that an average of 8.5 per cent of loci were outliers (range 0.4–24.5%). However, as Nosil et al. [2] point out, comparisons across studies are difficult because of the variety of analytical approaches used and the range of significance cut-offs used by different researchers. For example, De Carvalho et al. [28] used a relatively liberal false discovery rate of 10 per cent for identifying outliers, but only considered loci that were outliers according to two different methods that model different aspects of divergence; in contrast, Richardson et al. [26] considered loci significant outliers if they were above the 0.995 quantile. Likewise, multiple comparisons are often made among replicate populations or at different geographical scales; in some cases, only loci found to be outliers in more than one comparison are considered significant [28], whereas others view unique loci as potentially reflecting local adaptation in specific population comparisons. In addition, researchers often remove some proportion of highest and lowest divergence loci before estimating the putatively neutral genome-wide FST value that is used in simulations [54]. However, the proportion trimmed appears somewhat arbitrary; different researchers trim different amounts (usually 20–30%, although as low as 0.5%) from each tail of the distribution, while some do not trim at all. Nosil et al. [55] found that varying the amount trimmed between 10 and 30 per cent had a modest impact on the estimated ‘neutral’ FST and little to no impact on the outlier loci identified; however, Caballero et al. [56] found in a simulation study that the use of a trimmed mean FST resulted in substantially more outlier loci compared with non-trimmed mean FST. In any given study, the impact will depend on the overall distribution of FST values, and as such is difficult to predict. Nonetheless, to the extent that it is possible to compare across studies, we note that results in plant systems are quite similar to those reported in animal systems in terms of the numbers of outlier loci identified.

In addition to identifying outlier loci, researchers in a number of studies have attempted to identify associations between these loci and climatic or phenotypic traits thought to potentially be adaptively relevant. For example, a number of studies have found that outliers among populations in several tree species are often significantly associated with aridity or temperature, especially minimum seasonal temperatures [26,41].

Comparisons between potentially adaptive phenotypic divergence among populations and divergence at putatively selected versus neutral loci may be a useful source of additional confirmation. For example, Herrera & Bazaga [33] identified nine loci as high-divergence outliers among populations of the violet species Viola cazorlensis, and found that divergence in floral traits potentially under pollinator-mediated selection was significantly associated with divergence at the outlier loci but not the remaining putatively neutral loci; they interpreted this as evidence of adaptive divergence for floral traits at these loci. Alternatively, these loci may reflect local adaptation at some other unknown trait. A similar approach has been taken in comparing associations with different habitats to which populations are potentially adapted at outlier versus neutral loci (reviewed in [2]).

The ability to associate differentiated loci with potentially functionally important variation also allows a more concrete set of conclusions to be drawn from genome scan data. In the most genomically comprehensive scans to date in plants, Turner et al. [46] used an Arabidopsis thaliana tiling array and high-throughput sequencing of pooled population samples [47] to investigate the genomic basis of adaptation to serpentine soils in Arabidopsis lyrata. Using the A. thaliana genome as a reference, they were able to show that outlier loci between populations on serpentine versus non-serpentine soils were enriched for genes involved in ion transport and metal detoxification spread throughout the genome. They also found that differentiation between serpentine and non-serpentine soils involves several large (10–80 kb) duplications or deletions on multiple chromosomes, containing numerous genes with known or suspected roles in protection from toxic compounds. Finally, using replicate serpentine and non-serpentine populations in Scotland and the USA, they found evidence for both parallel and convergent evolution in three highly differentiated genes involved in ion transport and metal detoxification [47].

Other genome scan studies have focused on genomic patterns of reproductive isolation and species differences. For example, Yatabe et al. [22] used 108 mapped microsatellites to examine patterns of divergence between broadly sympatric, hybridizing sunflower species, Helianthus annuus and Helianthus petiolaris. They identified five outlier loci (4.6%), but found no significant association between levels of genetic divergence and proximity to previously identified quantitative trait loci (QTLs) for either species differences or sterility. Based on this, and on the overall low levels of genomic differentiation between the two species, they concluded that the unit of isolation between the two species is probably quite small, with much of the genome permeable to gene flow; other studies using different datasets have confirmed this result [23,57]. However, a study of a parapatric species pair, H. annuus and Helianthus debilis, showed that markers near QTLs for phenotypic differences and hybrid sterility introgressed at lower than expected rates. The exceptions were two markers near QTLs for traits at which admixed populations had converged phenotypically; these markers introgressed at higher than expected rates. This result is consistent with experimental work indicating adaptive introgression of both biotic and abiotic traits in H. annuus ssp. texanus [58,59]. In a larger genome scan study involving these species, Scascitelli et al. [24] found that three of 88 (3.4%) loci were high-divergence outliers but found no evidence of clustering of outliers (all outliers were on different linkage groups).

Other studies of genomic patterns of reproductive isolation have considered genomic scan data in the context of introgression patterns in hybrid zones [60]. A recently described ‘genomic clines’ method [45] compares introgression patterns at individual loci relative to the genomic background pattern of introgression to detect several different forms of selection. Using this approach, Gompert & Buerkle [45] reanalysed a dataset from two independent H. annuus/H. petiolaris hybrid zones and found that 16 of 61 loci differed significantly between zones, compared with one in the original analysis [44]. These results suggest greater intraspecific variation for isolating factors than originally reported. Another study taking this genomic clines approach considered three hybrid zones between the European aspen species Populus alba and Populus tremula [43]. Circa one-third (33/93) of loci were high-divergence outliers, and two-locus epistasis was common. These results, combined with the fact that genomic patterns of isolation are similar across three hybrid zones, suggest strong intrinsic post-zygotic barriers between the two species.

Both sunflower and aspen hybrid zone analyses found that a large proportion of the genome displayed non-neutral patterns of introgression. However, as with many genome scan approaches, the method of Gompert & Buerkle [45] assumes that a small proportion of the markers analysed are under selection. As discussed above, if this assumption is violated, the number of loci actually under selection may be under-estimated. In addition, one implementation of the method requires that allele frequencies in the parental populations be similar across all loci. Gompert & Buerkle [45] suggest that this can be achieved by choosing markers with fixed or nearly fixed differences between the parental populations. While this seems reasonable for populations in recent secondary contact, for populations with a longer history of hybridization, selecting for the most differentiated loci will bias the estimated genomic background level of admixture, making it more difficult to detect low-admixture outliers and excluding loci that have experienced adaptive introgression from the analysis entirely.

Another promising approach that to our knowledge has very rarely been tried (but see [24]) is to combine genome scan and QTL mapping approaches with comparisons of demographic parameters such as gene flow rates, effective population sizes and divergence times [61,62] among classes of loci (for example, comparing outliers versus non-outliers or candidate genes versus non-candidate genes). Genes associated with reproductive isolation or species differences should exhibit lower levels of gene flow and potentially smaller local effective population sizes. Variation in divergence time estimates among loci might also be used to suggest specific modes of divergence. For example, when populations are initially strongly isolated at one or a few loci while gene flow throughout the rest of the genome is unimpeded, we might expect to see older divergence time estimates in analyses of datasets containing just these loci compared with datasets containing only neutral loci; however, if this initial isolation results in a substantial genome-wide decrease in gene flow between incipient species, variation in divergence time estimates among datasets would be very limited. Current methods for estimating these parameters [61,62] assume selective neutrality, and it is not clear how selection may affect parameter estimation. A recent simulation study in which a single locus that had undergone a divergent selective sweep was analysed together with neutral loci showed that current effective population size and gene flow estimates were biased downwards as expected, while divergence time estimates were largely unchanged [63]. However, more detailed studies, including studies analysing neutral and selected loci separately, are required to more thoroughly validate this approach. Moreover, estimates of gene flow timing among loci might help determine the order in which reproductive barriers or species differences emerged. Current methods for estimating long-term gene flow timing are not informative [64], but this would be a valuable avenue for future work.

4. Conclusions

Above, we have tried to highlight examples of plant genome scan studies that go beyond the simple identification of anonymous outliers. While these more enriched approaches are still relatively few, they might already provide some insights into probable modes and models of species divergence. For example, cases where loci identified as outliers have been mapped to genomic locations suggest two emerging patterns. First, initial results indicate that there is often a poor match between genomic regions containing high-divergence outliers and regions containing QTLs for phenotypic differences or reproductive isolation. There are a number of possible explanations for this lack of congruence across methodologies, both biological and technical. For example, the traits measured in QTL studies might be unimportant as barriers to gene flow in natural populations. Alternatively, if genomic regions of differentiation are small, genomic scan data may be too coarse to detect QTL effects on divergence. In addition, given the many potential sources of false-positives in genome scans, it is possible that some uncorroborated outliers may be artefacts.

Second, genome scans of hybridizing populations or species thus far provide little evidence for large genomic islands of divergence (see electronic supplementary material, S1) as predicted by DH [14]. However, much higher density scans, and comparisons of genomic patterns of divergence of both sympatric and allopatric populations, will be required to address the evidence for or against this and other such models of genomic divergence.

Realizing the promise of genome scan methods clearly requires more theoretical, methodological and empirical development. We have already highlighted the need for clearer expectations about patterns of genomic divergence under specific temporal, selective and genetic scenarios, for more careful consideration of alternative hypotheses in the interpretation of genome scan studies, and for more nuanced models and methods with which to analyse these data [3,12]. Finally, if (as we hope) genome scans are to contribute to our understanding of the genomic architecture of adaptation and speciation, and the nature of the genes involved in these processes, they must move beyond the identification of outlier regions to the dissection and characterization of their underlying genes. To do so will require integration with complementary approaches such as QTL mapping and sequence and functional analyses, which can reveal the phenotypes and physiological responses associated with these loci and uncover the exact genetic changes and functions of the genomic regions involved [5].

Acknowledgements

This work was supported by a National Institutes of Health Ruth L. Kirschstein Postdoctoral Fellowship (5F32GM072409-02) to J.L.S. and a NSERC grant (327475) to L.H.R.

Footnotes

References

View Abstract