Recently, the study of ancient DNA (aDNA) has been greatly enhanced by the development of second-generation DNA sequencing technologies and targeted enrichment strategies. These developments have allowed the recovery of several complete ancient genomes, a result that would have been considered virtually impossible only a decade ago. Prior to these developments, aDNA research was largely focused on the recovery of short DNA sequences and their use in the study of phylogenetic relationships, molecular rates, species identification and population structure. However, it is now possible to sequence a large number of modern and ancient complete genomes from a single species and thereby study the genomic patterns of evolutionary change over time. Such a study would herald the beginnings of ancient population genomics and its use in the study of evolution. Species that are amenable to such large-scale studies warrant increased research effort. We report here progress on a population genomic study of the Adélie penguin (Pygoscelis adeliae). This species is ideally suited to ancient population genomic research because both modern and ancient samples are abundant in the permafrost conditions of Antarctica. This species will enable us to directly address many of the fundamental questions in ecology and evolution.
The field of ancient DNA (aDNA) research has witnessed remarkable growth and achievements in its relatively short history. In its first two decades, aDNA research focused primarily on the recovery of short fragments of DNA amplified through the polymerase chain reaction (PCR). Typically, results were limited to mitochondrial loci, and often sequences were collected from only one or a small number of samples . Nonetheless, informative DNA sequences were reported from a wide range of species, including ancient humans  and Late Pleistocene and Holocene megafauna [3,4]. Over the last decade, however, there has been a period of great expansion in aDNA research as the field has incorporated improved DNA extraction techniques, second-generation sequencing technologies, and more recently, whole-genome capture methods . The most prominent publications in aDNA research during this period have exploited second-generation sequencing . These publications have rapidly shifted the focus on the field to the recovery of full mitochondrial and nuclear genomes of targeted species, and have included draft nuclear genomes of human and woolly mammoth samples from the Holocene [7–9], Neanderthal and Denisovan specimens from the Late Pleistocene [10,11], and, astoundingly, a 700 000 year old horse genome . Simultaneously, such efforts are exploring the limits of DNA preservation, recovery and sequencing . Conversely, numerous smaller studies have used improved aDNA extraction and amplification methods to recover informative loci from relatively large numbers of preserved samples within a targeted species [14,15]. It is clear that both strategies have merit. For example, studies of ancient humans will probably always be limited to a small number of available samples . Yet the value of these data in understanding the history of our own species is important, and consequently, substantial resources have been dedicated to the production of individual draft hominin genomes [9–11]. Alternatively, in taxa with larger numbers of preserved samples, more extensive sampling focusing on single or several loci can allow direct testing of hypotheses about a species' history . At the same time, neither strategy can be considered ideal, and it is a difficult challenge to balance the number of individuals sampled and the number of loci sequenced in order to produce the most robust and informative datasets within economic constraints.
To date, there have been no published examples that incorporate strategies featuring the application of both improved DNA recovery methods and high-throughput sequencing technologies to large numbers of ancient samples. Such examples can provide insights into questions regarding genomic responses to environmental pressures and provide useful data for directly testing evolutionary hypotheses, for example, whether the molecular signatures observed in the data could be explained by selection or by neutral mutational process.
In this paper, we present the initial results of genomic sequencing of large numbers of ancient samples of the Adélie penguin (Pygoscelis adeliae Hombron and Jacquinot). Modern and ancient breeding colonies of Adélie penguins are abundant in the Ross Sea region of Antarctica, allowing large numbers of both modern and ancient samples to be collected. In addition, the cold and dry Antarctic environment provides ideal conditions for the preservation of bone and soft tissue, and at the same time, decreases the rate of degradation of endogenous DNA and contamination of sample tissues by fungi and bacteria. Last but not least, Adélie penguin colonies feature typically high rates of natal return , such that single colonies can provide stratigraphic records spanning thousands of years.
2. The benefits of a population genomics approach
The concept of population genomics has been foreshadowed in a number of reviews [18,19] and has been discussed in relation to conservation biology , adaptation , landscape-evolution effects  speciation [23,24] and molecular ecology , among others. In general, the strengths of a population genomics versus a population genetics approach are twofold. First, as a direct extension of population genetics, population genomics offers increased accuracy and an improved ability to detect outlier loci for traditional population genetic estimations, such as effective population size and relatedness of populations [19,26]. Second, population genomics allows for more effective exploration of questions of ‘how?’ and ‘why?’ rather than simply ‘what?’ and ‘when?’. For example, since the entire or large parts of the genome can be interrogated at once, it is possible to test broadly for regions of the genome under selection, and from these tests form hypotheses relating adaptation to life-history traits and historic environmental and climate trends (e.g. ). Alternatively, broad patterns of heterozygosity at variant sites across the genome may be used to interrogate long-term demographic trends, which in turn may be used to better understand historic genetic admixture patterns . Ultimately, greater exploration and application of a species' nuclear genome should result in finer resolution of population demographic parameters, more robust interpretations of species' or population histories, and a wider range of questions that can be addressed in research efforts . A population genomics approach should greatly benefit aDNA research in particular, as to date the great majority of population genetic studies using historic samples have been limited to very small regions of the (primarily) mitochondrial and nuclear genomes [14,15]. Inclusion of a broader sampling of the nuclear genome in these systems will alleviate dependence on single to several loci and should increase the statistical robustness of results.
Aside from financial considerations, perhaps the greatest challenge in the broad application of a population genomics approach to both modern and ancient taxa is the lack of reference genomes in non-model taxa. This challenge has been addressed with numerous strategies , and certainly will be alleviated by degrees as an increasing number and diversity of genomes is sequenced. Still, the inherent characteristics specifically of ancient nucleic acids limit the generation of genome-scale data in taxa without a reference genome. For example, consistent de novo assembly of shotgun-sequenced degraded DNA fragments is unlikely at the genome scale . The highly fragmented nature of aDNA combined with typically low endogenous contents would also pose great challenges to many reduced representation library approaches, such as restriction-site associated DNA sequencing . Furthermore, poor conservation of RNA molecules, generally precludes generation of transcriptome sequence (although see ). Alternatively, more effective starting points are either using a genomic assembly available from one or more closely related taxa (i.e. use of the human genome for assembly of the Neanderthal genome ) or de novo genomic sequencing and assembly in a contemporary representative of a targeted taxon. In either case, significant portions of the genome can be identified and targeted for sequencing and/or assembly through targeted enrichment or shotgun sequencing strategies in ancient samples.
3. Developing strategies to counter low endogenous DNA in ancient samples
From the earliest reports of aDNA sequencing, it has been clear that most ancient samples contain low amounts of both total and endogenous DNA [32,33]. This is due both to spontaneous damage (e.g. oxidation, hydrolysis) resulting in fragmentation and difficulty in enrichment of endogenous sequences, and also to colonization of tissues by degrading fungi and bacteria [34,35]. Prior to second-generation sequencing technologies, it was difficult to determine the endogenous DNA contents of ancient samples. Rather, efforts were made at quantifying whole-DNA extractions typically through fluorescence measurements (e.g. ). Second-generation sequencing platforms, which rapidly produce millions to billions of base pairs (bp) of sequence data in typically short (tens to hundreds of bp) fragments, are well suited both for aDNA sequencing in general and for quantifying endogenous DNA contents of ancient samples specifically. Application of these platforms to aDNA studies has greatly facilitated direct measurement of endogenous DNA contents from a variety of samples.
In a broad sense, expected trends have been supported, for example, specimens preserved under relatively optimal conditions (e.g. permafrost) tend to have higher endogenous DNA contents. Notably, Middle Holocene woolly mammoth and Palaeo-Eskimo samples have yielded very high (45–94%) levels of endogenous DNA [7,8,37]. On the other hand, selected maize samples aged to ca 700 years and from temperate environments have also yielded high (greater than 90%) endogenous DNA contents , and Denisovan samples from a cave in southern Siberia aged 30 000–50 000 years contained up to approximately 70% endogenous DNA contents [11,39]. More typically, however, levels of endogenous DNA in ancient samples range from ca less than 1 to 10%. For example, efforts with Neanderthal specimens have reported from ca 0.2 to 6% endogenous contents [10,40,41], while levels from Pleistocene cave bear samples have ranged from 1.1 to 5.8% . It is also important to note that many of these values result from ‘best case scenarios’. For example, the levels of endogenous Neanderthal aDNA reported by Green et al. (ca 6%)  represented the most promising sample among over 70 bone and tooth samples tested.
The direct quantitation of endogenous DNA contents in ancient samples has helped spur the recent development and application of enrichment and targeting strategies associated with second-generation sequencing. Perhaps the simplest approach involved application of several restriction enzymes to Neanderthal genomic libraries to selectively cleave contaminating bacterial sequences , effectively resulting in a four- to sixfold increase in the proportion of endogenous DNA. Targeted enrichment strategies using multiplex PCR and both array- and solution-based hybridization have also been applied to selected nuclear loci and organelle genomes [38,42–44], the nuclear exome  and, most recently, full chromosomes and nuclear genomes [5,45]. Finally, an innovative single-stranded library preparation technique has recently been developed , which increases representation of intact strands of highly degraded DNA. To some degree, all of these advances have been in response to limitations in sample quality or quantity, or to economize data generation. For taxa that consistently yield high levels of endogenous DNA from ancient samples, it is possible that these strategies will not be necessary. Ultimately, any given project will require a balance between the amount and quality of samples available, financial constraints and the amount of data appropriate to pursue individual research questions.
4. The Adélie penguin system
Adélie penguins represent the dominant feature of the Antarctic terrestrial fauna during the austral summer. With a total population size estimated around 2.5–3 million breeding pairs , Adélie penguins comprise approximately 90% of the avian biomass of Antarctica . Breeding colonies are irregularly distributed around the Antarctic continent, and are almost exclusively restricted to ice-free coastlines. For example, around 30% of the total breeding population is found along the coastline of the Ross Sea, while single large colonies (greater than 100 000 breeding pairs) are present in similar habitat on both the Antarctic Peninsula and Rauer Island . There is direct evidence that individuals of the species exhibit strong natal return over significant periods of time , although this pattern may be disrupted by environmental alterations, such as the presence of mega-icebergs . Trophic interactions involving the Adélie penguin are relatively simple, as adults are preyed upon almost exclusively by leopard seals . In turn, the great proportion of the Adélie penguin diet consists of krill (Euphausia spp.) and silverfish (Pleuragramma antarcticum) . Adélie penguin eggs and young are also preyed upon by large seabirds, most prominently by the South Polar skua (Catharacta maccormicki) .
In the Ross Sea area, there are a large number of modern colonies of Adélie penguins (figure 1a). In addition, abandoned colonies in this region were occupied in the Holocene period (figure 1b). Finally, there are a smaller number of locations that were occupied in the Late Pleistocene period, which are found south of the Drygalski Ice Tongue (figure 1c). The ages and occupation periods of many of these colonies have been determined by radiocarbon methods [53–55]. In general, these records indicate that the Adélie is both a resilient species and capable of colonizing new areas. Several colonies, for example those found at Inexpressible Island, Franklin Island and Adélie Cove, have supported Adélie penguin occupation for periods up to approximately 7500 years, while most modern colonies are less than 2000 years old  and some abandoned colonies appear to have been used over time intervals of only several hundred years . Historic demographic patterns have been largely correlated with changing glacial and sea ice conditions, which are in turn linked to climatic trends, and reflect the dynamic nature and environmental sensitivity of the Adélie penguin species documented in studies of contemporary colonies [56–58].
Supported by this intricate understanding of Adélie and the environmental history of the Ross Sea area, aDNA sequences obtained from ancient Adélie penguin samples have been applied to questions of broad evolutionary significance. The mitochondrial locus hypervariable region 1 as well as complete mitochondrial genomes has been used to directly measure rates of evolution in comparison to phylogenetic-based estimates [59–61], and to determine coalescence times between two distinct and highly variable mitochondrial lineages . Nuclear microsatellite loci, on the other hand, have been used to demonstrate and quantify microevolutionary change  over a period of several thousand years. Although studies such as these reveal important insight into evolutionary patterns, thus far results have been limited by PCR-based methodology to a relatively small amount of sequence. Application of second-generation sequencing to well-preserved Adélie samples could not only strengthen results such as these, but also broaden the scope of testable evolutionary hypotheses through generation of increased numbers and types of sequenced loci.
(a) Adélie library characteristics and genomic coverage
We prepared 56 indexed ancient Adélie penguin libraries from samples collected from the Ross Sea area of Antarctica and C14 aged from several hundred to ca 7000 years, using methods optimized for aDNA extraction and sequencing [63,64]. Preliminary sequencing of these samples in a full flow cell of the Illumina HiSeq2000 (eight samples per lane) was performed in order to estimate genomic coverage and endogenous DNA contents and potentially recover high-copy loci, and resulted in an average of 15.5 million total sequence reads per sample (3.2–24.7 million reads range).
In the case of the Adélie penguin, limited, but high quality, genomic resources are available, as this species' genome has been sequenced through a larger consortium project (http://phybirds.genomics.org.cn/Phylogenomics analysis of Birds). The genome consists of 1.203 billion bp of sequenced DNA assembled in 752 scaffolds and 165 fully contiguous sequences. Approximately, 15 300 mRNA sequences have been annotated, consisting of over 140 000 coding sequences, although less than two-thirds of these have been assigned putative functions. In addition, several fully sequenced mitochondrial genomes for the Adélie penguin are available in Genbank. From mapping sequence reads in our prepared libraries to the draft nuclear genome and a representative mitochondrial genome (Genbank ID KC875855.1), endogenous contents ranged from 0.01 to 60.55% (16.83% average), and were highly variable within sampling locations (figure 2). As expected, the great majority of endogenous reads (greater than 99.6% average) mapped to the nuclear genome.
The average length of reads aligning to the reference genomes was calculated at just over 56 bp, with reads aligning to the mitochondrial genome slightly longer than reads aligning to the nuclear genome (58.23 bp compared with 56.34 bp, respectively; p = 0.0141; table 1). On average, approximately 10% of the nuclear genome was covered by one or more reads (123.7 million positions on average), while 16.1 thousand positions of the 17.5 kbp mitochondrial genome were covered. The resulting average coverage depths were 0.136 and 33.39 for the nuclear and mitochondrial reference genomes, respectively (table 1). We found that neither endogenous DNA content nor read length were strongly correlated to the age of the sample, nor was average read length correlated to the endogenous DNA content (r2 < 0.05 in all cases). Conversely, the total number of aligned bases (i.e. the total number of aligned reads multiplied by read length) was positively correlated to the number of variant sites identified when mapping reads to either the nuclear or mitochondrial reference genome (figure 3). This relationship was relatively weak for the mitochondrial genome due to saturation effect owing to its small size. Saturation was also probably responsible for higher rates of duplication seen in reads mapping to the mitochondrial versus nuclear genome (4.74% compared to 20.13% average, respectively; p < 0.001), as there was a stronger correlation between coverage depth and duplication levels for reads aligned to the mitochondrial genome than the nuclear genome (y = 0.4059x + 6.5749, r2 = 0.84906 versus y = 14.02x + 2.8374, r2 = 0.48116, respectively).
(b) Full mitochondrial genomes add insight into rates of evolution
We also sequenced complete mitochondrial genomes of 22 modern Adélie penguin individuals to an average coverage depth greater than 20×. Combined with the assemblies from ancient samples discussed above and previously published mitochondrial genomes, we analysed a total of 29 selected ancient and 46 modern Adélie penguin mitogenomes to estimate the rate of mitochondrial sequence evolution. The Bayesian statistics based Markov chain Monte Carlo method employed in the software BEAST was used for this purpose . The best model of sequence evolution was determined by the program Modeltest . We used an uncorrelated lognormal molecular clock to estimate the rate. We first used synonymous codon positions to estimate the neutral rate of evolution. Our analyses produced a rate of 0.07 (highest posterior density, HPD 0.042–0.100) substitutions per site per million years (s s−1 Myr−1) (table 2; figure 4a) under either a constant or exponential population growth model. We also examined rates of evolution in constrained sites such as non-synonymous sites and RNAs using methods described previously . The rate at non-synonymous sites was nearly an order of magnitude slower than that at neutral sites (0.007 s s−1 Myr−1; HPD 0.002–0.012). In turn, the rate for rRNA loci (0.01 s s−1 Myr−1) was close to that of non-synonymous sites, while the rate for tRNA loci (0.02 s s−1 Myr−1) was approximately twice that of the former. Our rate analysis using neutral synonymous sites revealed that the coalescence age of Adélie penguin populations used in this study is 101 000 years (HPD 55–167 kya; figure 4b). The time of divergence between the two major Adélie penguin mitochondrial haplotypes, the Antarctic and Ross Sea lineages, was estimated to be 53 kya (HPD 44–68 kya).
Our understanding of the state of preservation of many ancient Adélie penguin remains preserved over serial time points, together with the availability of modern samples suggests this species is ideal for a comprehensive ancient population genomic study. Most importantly, the presence of large numbers of well-preserved samples contributes to sequencing success and to the ability to perform robust tests of evolutionary hypotheses. In this study, we used modifications to DNA extraction procedures and an improved genomic library building method [63,64] to efficiently build a large number of genomic libraries from ancient Adélie penguin samples for sequencing on the Illumina platform. These samples contained relatively high levels of endogenous DNA (ca 17% on average) and are amenable to increased sequencing efforts. Even with the preliminary sequencing effort reported here (eight samples per lane of an Illumina HiSeq), these libraries produced 56 mitochondrial genomes, at an average coverage depth of over 20× and over 90% completion. Our efforts also produced on average over 120 million bp of nuclear data per sample, with consistently low rates of sequence read duplication. Because of the relative high diversity of sequence reads in these genomic libraries, it is likely that increased sequencing effort will allow the entire nuclear genome of most of these samples to be sequenced. Alternatively, it is likely that a targeted enrichment strategy could be effectively applied in Adélie penguins to efficiently standardize the portion of the nuclear genome sequenced. This may be particularly important as older samples are incorporated into our sequencing efforts, for example from preserved remains aged prior to the last glacial maximum .
The evolutionary rates estimated using 75 complete mitochondrial genomes were very similar to our previous estimates using a much smaller dataset of 20 mitogenomes (table 2) . For instance, the rate at synonymous sites estimated in this study (0.07 s s−1 Myr−1—HPD 0.042–0.1) was nearly identical to our previous estimate (0.073 s s−1 Myr−1—HPD 0.025–0.123). However, the increased dataset used in this study resulted in a narrower HPD interval. These estimates offer further support for our claim of higher mitogenomic rates in birds and penguins, in particular . Similarly, these results are consistent with our previous estimates of coalescence times within and between the two distinct mitochondrial lineages of the Adélie penguin (the Antarctic and Ross Sea lineages) based on sampling of a smaller portion of the mitochondrial genome [61,62]. The divergence time between Antarctic and Ross Sea mitochondrial lineages estimated in this study (53 kya; HPD 44–68 kya) overlaps with that estimated previously (ca 63 kya). However the ages of the Antarctic (44 kya) and Ross Sea (50 kya) lineages estimated here were much higher than those estimated in our earlier work (18 and 19 kya, respectively). As previous estimates were based only on modern samples, it is probable that the observed older ages of these lineages were due to inclusion of ancient penguin mitochondrial genomes. This suggests the extinction of a number of ancient penguin populations belonging to these two lineages.
We have reported here only the mitochondrial genomes generated in this study. We intend to increase sequencing efforts to produce complete nuclear genomes of ancient and modern Adélie genomes. The large number of genome-wide population polymorphisms identified by this work will provide new insights on the genetics and evolution of Adélie penguins over time. For example, estimates of rates and patterns of genetic variation throughout the genome will help to identify regions showing significant deviations with respect to mutation and recombination, including non-coding regions under non-neutral evolution. This will further help to quantify and localize episodes of purifying selection and adaptive evolution in Adélie penguin genomes, and identify conserved regulatory elements. Furthermore, single nucleotide polymorphism in coding genes could be used to identify protein functional pathways influenced by positive selection. This might elucidate adaptive functional responses to increased temperatures in the Antarctic. Although previous studies have examined some of the patterns mentioned above they were generally based on one or several genes [59,60]. A major drawback in such studies is that they fail to distinguish the gene-specific from genome-wide evolutionary patterns. As we will examine all or most of the genes in the Adélie genome we will be able to quantify the mean genome-level mutation or selection patterns and hence could measure how individual genes deviate from the genomic mean.
Although results from ancient Adélie penguin samples are promising, there are probably many other ancient systems that will also prove highly suitable for the application of high-throughput sequencing for large numbers of samples. For example, previous work on large Arctic and subarctic mammals, such as the Beringian steppe bison, brown and cave bears, and woolly mammoth, reported the recovery of DNA from nearly 30 to over 400 samples [67–70]. At the same time, it is also likely that taxa from more temperate environments will provide useful data. Previous suggestions to pursue aDNA sequencing in brine shrimp from hypersaline lakes or extinct avian taxa from Polynesian islands (, see also [1,72,73]) seem prescient in the light of second-generation sequencing advancements, as large numbers of chronologically dated samples are potentially available. Alternatively, sample-rich temperate systems might be revisited with updated technologies, such as the New Zealand brown kiwi  or various rodent species ([75,76], see also ). Finally, while the majority of aDNA research has traditionally targeted humans and other animals , certainly other lineages, including plants and bacteria, should not be overlooked. The successful sequencing of nuclear loci from a preserved conifer species  suggests that ancient woody plant samples may be directly sampled and could be excellent targets for high-throughput aDNA studies.
In concert with the development of increasingly high-throughput sequencing technologies and improved aDNA extraction techniques, the field of aDNA is poised to make great and robust empirical contributions to evolutionary biology as it moves into its fourth decade. In particular, if focus is shifted toward ideal taxa (i.e. those with large numbers of high-quality samples), we will rapidly gain a better understanding of genomic evolutionary changes from the Pleistocene and Holocene periods to the present.
This research was funded by the Australian Research Council in the form of a linkage grant LP110200229: ‘How will animals respond to climate change? A genomic approach’. Logistic support was provided by Antarctica New Zealand.
The authors would like to thank the following people and groups: Indy Siva from Griffith University, Edan Scriven from the University of Queensland, and the Queensland Cyber Infrastructure Foundation for computing resources and assistance, Yvette Wharton for field assistance, Vivian Ward for graphics, and Tim Heupink and Leon Huynen for valuable comments on the manuscript.
One contribution of 19 to a discussion meeting issue ‘Ancient DNA: the first three decades’.
- © 2014 The Author(s) Published by the Royal Society. All rights reserved.