Prologue ‘As the study of natural science advances, the language of scientific description may be greatly simplified and abridged. This has already been done by Linneaus and may be carried still further by other invention. The descriptions of natural orders and genera may be reduced to short definitions, and employment of signs, somewhat in the manner of algebra, instead of long descriptions. It is more easy to conceive this, than it is to conceive with what facility, and in how short a time, a knowledge of all the objects of natural history may ultimately be acquired; and that which is now considered learning and science, and confined to a few specially devoted to it, may at length be universally possessed in every civilized country and in every rank of life’. J. C. Louden 1829. Magazine of natural history, vol. 1.
This article is part of the themed issue ‘From DNA barcodes to biomes’.
For more than two centuries, biodiversity science has focused on the inventory of species, on probing their relationships and on clarifying the factors responsible for their diversification. The sheer diversity of life, the fact that millions of species of multi-cellular organisms await description, is a serious barrier to scientific progress. Moreover, morphological approaches cannot enable the census of these species in a timely or affordable fashion; the cost of describing five million animal species has been estimated at $250 billion and as requiring six centuries . Eleven years ago, Savolainen et al.  considered the possibility that DNA barcoding might allow the encyclopedia of life to be written in decades rather than a millennium. The present issue considers progress towards this goal and provides a glimpse of the ways in which DNA barcoding is transforming biodiversity science. The 16 articles included in this issue derive from plenary presentations at the 6th International Barcode of Life Conference held in August 2015. When coupled with the conference abstracts , it is clear that DNA barcoding is contributing to rapid scientific progress on diverse fronts. This outcome might not have been predicted just a decade ago.
When the Natural History Museum in London hosted the 1st International Barcode of Life Conference in 2005, it anticipated a lively discussion with an uncertain outcome. Some researchers involved in large-scale biodiversity inventories viewed DNA barcoding as a breakthrough [4–6], but endorsements from other segments of the community were restrained. Because prior genetic approaches [7–9] had modest impact on their workflows, some taxonomists anticipated that DNA barcoding would also have limited influence . Others  highlighted the risks in basing taxonomic decisions on sequence variation in a single gene, noting the potential complexities introduced by paraphyly and polyphyly  and by the possible discordances between gene trees and species trees . These concerns could only be addressed by examining the efficacy of DNA barcoding in varied taxonomic assemblages in diverse environments. About five million specimens have now been analysed, and these results indicate that DNA barcodes can discriminate most species.
The balance of this introductory article considers the factors that were important in mobilizing a DNA barcode research community, and the issues that needed consideration in construction of the reference library. It also addresses the effectiveness of DNA barcoding as a tool for specimen identification and species discovery before examining the scientific impacts of work in this field and future prospects.
2. Community mobilization
More than 1000 publications on DNA barcoding appeared in 2015, a count higher than that for many other major scientific programmes (figure 1). The growth in interest and global involvements in this field  are further signalled by the increasing participation in the International Barcode of Life Conferences (figure 1); 600 researchers from 55 nations joined the latest meeting.
DNA barcoding has rapidly become the largest research collaboration in biodiversity science. What provoked this? The establishment of the Consortium for the Barcode of Life (CBOL) in 2004 was a key development. It galvanized the community and quickly organized meetings to advance understanding of DNA barcoding, including the International Conferences in London (2005), Taipei (2007), Mexico City (2009) and Adelaide (2011). CBOL also worked with researchers to achieve consensus on the best DNA barcode marker(s) for each eukaryote kingdom [15–18]. While these activities were critical, there was also a great need to clarify the efficacy of DNA barcoding. The Gordon and Betty Moore Foundation sponsored the first large-scale evaluations [19,20], and the flow of data was reinforced with activation of the Canadian Barcode of Life Network in 2005 . By 2007, it was recognized that the development of a global DNA barcode reference library required a broad alliance, stimulating plans for iBOL, the International Barcode of Life project (www.iBOL.org), which aimed to deliver barcode records for 500 000 species within 5 years of activation. Because substantial resources (more than $100 million) were required, and plans called for research nodes in 25 nations, it took 3 years before fundraising and network development were sufficiently advanced for its activation. National barcode networks were ultimately established in 11 countries (Argentina, Austria, Brazil, Canada, China, Finland, Germany, Mexico, Norway, South Africa and Switzerland), but they emerged asynchronously; those in Argentina and Mexico launched in 2008 and 2009, while the Austrian and Norwegian networks began work in 2014. Researchers in other countries (e.g. Costa Rica, France, Kenya, The Netherlands, UK and USA) made major contributions without a formal network. Despite this organizational fluidity and varied activation dates, iBOL met its target for species coverage in August 2015 (figure 2).
Although CBOL, iBOL and the national networks stimulated the rapid rise of DNA barcoding, these grant-funded entities had finite lifespans. As a result, the research community needed to assume certain activities initiated by CBOL, such as the International Barcode of Life Conferences, and responsibility for their organization transitioned to countries with lead roles in iBOL (China, 2013; Canada, 2015; South Africa, 2017). Looking to the future, there will be an ongoing need for a research consortium to ensure that barcode coverage is extended efficiently and to aid the acquisition of the funds required for this purpose.
3. Constructing the DNA barcode reference library
Although DNA barcoding is conceptually simple, the assembly and curation of sequence information from one or more standard gene regions across millions of species is challenging. Over the past decade, improved laboratory protocols have simplified barcode acquisition [22–24]. As a result, five million specimens were analysed by July 2016, providing coverage for some 60 000 plant and 450 000 animal species, although many of the later taxa were undescribed. As these totals likely represent no more than 20 and 5% of the species in these kingdoms, much work remains. However, achieving the level of barcode coverage required for an effective identification system  is a realistic goal for the biotas of Europe and North America by 2025 . Completion of the global library might require the analysis of 100 million specimens, presuming a target of 10× coverage per species, but it could be completed in a few decades with adequate resources. Because achieving a well-parametrized global library will require specimens, sequence analysis and data management, the rest of this section considers these matters in more detail.
(a) Sourcing specimens
The most expensive component in DNA barcode analysis is specimen acquisition. Obtaining sets of many thousands of voucher specimens with expert annotation requires enormous effort. Viewed from this perspective, natural history collections are a valuable legacy, especially herbaria as barcode recovery is high, even from specimens a century old . Some animal groups are challenging, particularly those preserved in formaldehyde , but others are more tractable . Because of their greater sensitivity, high-throughput sequencers (HTS) allow barcode recovery from specimens recalcitrant to Sanger analysis . Their use to obtain barcodes from type specimens is particularly important as the resulting data serve to create a searchable index of specimens linked to binomials, facilitating the correct application of names and the resolution of synonymies [30–32]. While the analysis of museum specimens will extend barcode coverage for named species, new collections will be required for groups that have seen little taxonomic attention and for under-collected regions of the planet. However, as evidenced over the past decade, the biodiversity science community has a strong capacity to make collections. In considering the task ahead, it is important to emphasize that a highly effective identification system is achieved long before the last species is analysed because most surveys encounter common, widely distributed taxa rather than those that are either very rare or narrow endemics. Moreover, when one of the latter taxa is encountered, its presence is ordinarily signalled by its assignment to a new barcode cluster, provoking referral of the specimens to a taxonomist working on the group, allowing its subsequent inclusion in the barcode reference library.
(b) Acquiring sequences
Presuming access to specimens, their barcode sequences must be recovered. As polymerase chain reaction (PCR) is employed to amplify the barcode region from genomic DNA, analysis is disrupted if amplicons are generated from pseudogenes  or from bacterial and fungal endosymbionts . Pseudogenes have proven an infrequent problem because they are typically shorter  and are present in lower copy numbers than the barcode regions targeted for analysis. Sequences from bacterial endosymbionts are encountered more commonly , but they are easily excluded during data validation. Sequences from fungal endosymbionts can fail to be recognized, especially when the barcode is a DNA region, such as the internal transcribed spacers (ITS) of nuclear ribosomal DNA, which cannot be aligned across phyla , but spurious records will be excised as valid entries are acquired for each species.
Aside from problems introduced by the recovery of non-target DNA, library construction is also impeded if PCR fails to generate an amplicon, a situation that arises because no primer set is truly universal. This is particularly pertinent to maturase K (matK), one of the two core barcode markers for vascular plants, as existing primer sets have high failure rates . Recovery of cytochrome c oxidase subunit I (COI) from animals is more reliable, although each primer set targets a particular constellation of species (e.g. fishes, insects). While these primer sets are effective for their designated group, they occasionally fail, especially in fast-evolving lineages . Although amplification success could be raised by adopting a more conserved gene region, this would reduce taxonomic resolution . Moreover, the difficulty in recovery of COI amplicons has been exaggerated by in silico predictions of primer binding . In practice, primer sets employed for animals have strong performance with, for example, a single primer set generating sequences from 88% of specimens in 579 insect families . This result and those from similar studies on other groups of animals indicate that amplification failure is too infrequent to justify the shift to a more conserved gene region. However, there is a need for further work on primer design to conquer problems in certain groups, especially some marine taxa.
Presuming amplicon recovery, sequence characterization is the next step in the analytical chain. The barcode standard currently requires bidirectional Sanger analysis of each amplicon, an approach that generates a high fidelity, full-length read. In practice, unidirectional analysis delivers reads that meet key elements of the standard, suggesting the possibility of relaxing the requirement for bidirectional coverage. Aside from considering this adjustment, the barcode standard needs to be revisited in light of the very different attributes of the sequence records generated by HTS. It is certain that the volume of data generated by these platforms will rapidly expand because they enable the barcode characterization of bulk DNA extracts, a key advance for environmental monitoring [42–49]. A shift to HTS for barcode recovery from single specimens might also reduce costs for library construction [50,51], but substantial work will be needed to optimize data quality and bioinformatics protocols . Certainly, for the foreseeable future, the barcode standard needs enough flexibility to recognize the validity of records generated by different sequencing platforms so long as they satisfy the requirements for sequence quality, length and verifiability.
(c) Data management
The early development of BOLD, the Barcode of Life Data System, has been critical for the storage, validation and analysis of DNA barcode records generated via Sanger sequencing . Because it couples specimen and sequence information, this platform plays an increasingly important role as data volumes expand. Moreover, BOLD is gaining the capabilities needed to support large-scale biodiversity analyses. For example, its Barcode Index Number (BIN) system automates the delineation of molecular operational taxonomic units  as proxies for animal species and embeds each new BIN in a persistent registry . Work is also underway to allow BOLD to automatically position new BINs in the Linnaean hierarchy by exploiting taxonomic information linked to barcode records from known species. There will be a need for sustained vigilance to ensure that specimens providing barcode records have reliable taxonomic assignments. While major errors are easily recognized, misidentifications of closely allied species require careful examination to recognize and correct [56,57]. The development of a barcode library for known species is aided by ongoing efforts to create a registry of valid species names , but ‘dark taxa’, those only known from their DNA barcode sequences, will represent an increasingly important challenge . Although it is ultimately desirable to have all specimens with a sequence, a name and associated data, the fact that BINs provide a stable framework for subsequent annotation and data enrichment is a major breakthrough for tackling poorly known mega-diverse groups . Aside from the well-recognized challenges in the storage and analysis of the data generated by HTS, studies enabled by this technology will undoubtedly illuminate massive numbers of dark taxa.
(d) Data sharing and release
The traditional model has been to ‘publish and then release data’. However, wider cultural scientific changes focusing on building infrastructure and access to big-data have driven a shift to rapid data release and sharing. DNA barcoding has tracked this change, transforming from a series of large-connected research projects into a community movement using BOLD as a project management system and as a central repository of searchable barcode sequences.
4. DNA barcodes for specimen identification and species discovery
DNA barcoding is advancing biodiversity science by enabling the automated identification of specimens belonging to known species and by facilitating the recognition of new species . Its capacity to deliver these insights depends upon the reliability with which sequence variation in each barcode region discriminates species. Within the animal kingdom, there is generally a gap between intraspecific and interspecific variation in COI sequences, so barcoding is highly effective. The situation in plants is more challenging; barcode divergences at ribulose-biphosphate carboxylase (rbcL) and matK are often so low that closely allied species cannot be discriminated [52,60]. Studies on fungi  and the many lineages of Protista also indicate cases of variable discriminatory power.
DNA barcodes typically discriminate about 95% of known species; cases of compromised resolution involve sister taxa, often species that hybridize [19,20,62,63]. In the many taxa where geographical variation in barcode sequences is small , a few records per species are sufficient to create an effective identification system. However, the analysis of more specimens is advantageous because it often reveals discordances that indicate misidentifications or cryptic taxa , and it also provides insights into the extent of geographical variation in barcode sequences [66,67]. There are two animal phyla in which COI often fails to deliver species-level resolution, sponges [68,69] and some benthic cnidarians , apparently because of their slowed rates of mitochondrial evolution. Barcoding also fails to distinguish a small fraction of species in other groups, typically sister taxa or those whose status is uncertain [71,72]. Conversely, barcode analysis frequently exposes deep ‘intraspecific’ variation, situations that often represent overlooked species as evidenced by covariation with ecological or morphological traits [73–75]. However, some cases have other explanations; they seem linked to the merger of phylogeographic isolates , to rate acceleration  or to doubly uniparental inheritance . There remains a need to clarify patterns of DNA barcode sequence variation by examining selected nuclear loci or through genome-wide approaches such as RAD sequencing  to extend understanding of factors explaining the origins and maintenance of these cases of deep mitochondrial divergence.
DNA barcoding confronts the challenge that many plant species are exposed to hybridization and introgression, while others have arisen via polyploidy in a near-instantaneous fashion. Moreover, the evolutionary rates of their mitochondrial and plastid genomes are far slower than those in animals, creating a further barrier to species resolution. Given these factors, it is unsurprising that the designation of barcode markers for plants has been challenging. Although it was recognized that they would often not deliver species-level resolution, two plastid markers (matK and rbcL) were selected as the core barcodes for plants , supplemented with ancillary markers such as trnH–psbA, a plastid inter-genic spacer, and the ITS of nuclear ribosomal DNA [80,81]. Researchers focusing on highly degraded DNAs have also used a small plastid region from the trnL intron . Collectively, DNA barcoding has been deployed widely for discriminating plant species or species groups [83–85]. The quest for improved barcode resolution in plants is ongoing . The benefits of complete plastid genome sequencing have been noted by several authors [86–88] although this will not solve identification failures arising from plastid introgression, such as those presumed in Salix . Ultimately, further substantial gains in plant species discrimination will depend on cost-effective, standardized and scalable approaches for accessing data from multiple unlinked nuclear markers [38,52,88].
ITS is the standard DNA barcode marker for fungi and has been widely adopted and used by mycologists [18,61]. The use of sequence data for species discovery and identification is particularly important in this kingdom, because so many fungal species are both undescribed and unculturable . The recovery of ITS barcode sequences is sometimes compromised by intra-individual heterogeneity, reflecting its multi-copy nature , and alignment ambiguities can make it difficult to establish if the recovered sequence derives from the target species or a symbiont. As a consequence, there has been a search for secondary markers. COI has shown strong resolution in some groups , but its utility is constrained because the introns prevalent in fungal mitochondrial genomes often disrupt its PCR amplification from genomic DNA . This fact has provoked studies on diverse nuclear gene regions, such as large and small subunit ribosomal DNA , but no secondary marker has gained broad adoption. As with plants, efforts are shifting towards the incorporation of wider genomic coverage into barcoding workflows, creating a challenge to balance between the need for increased resolution with the requirement for a cost-effective, highly scalable assay. Another key issue for fungi is the growing divide between identified taxa and sequences, driven by the rapid growth of ‘sequences without names’ produced from metabarcoding studies, and also the need to increase the proportion of newly described species that have barcode sequences generated from type material. This parallels the dark taxa challenges for other highly diverse groups [32,39] and further sequencing of fungal types coupled with community agreement on linking sequence-only records to a naming system is a high priority .
Work on protists is in the early stages, but 18S RNA has been adopted as the core barcode marker  with full recognition that this gene region evolves too slowly to provide species resolution in most cases . However, because primers for 18S are effective across diverse phyla, they can provide the sequence information needed for a ‘rough’ taxonomic placement that can be followed by the analysis of secondary barcodes to obtain species-level resolution. The selection of secondary markers for varied protistan lineages is underway, and some core markers, such as COI and rbcL, have demonstrated utility [95,96]. However, it is certain that both the selection and testing of the efficacy of barcode regions will be challenging given the extreme diversity of protistan lineages .
5. Impacts of DNA barcoding
Although motivated by the goal of accelerating the inventory of biodiversity and making taxonomic information more accessible, DNA barcoding is providing opportunities for important investigations in other fields of enquiry [98,99]. The balance of this section briefly considers some of the research areas aided by its advance.
(a) Probing species
DNA barcoding is shifting taxonomic workflows in two ways. Firstly, it is providing an increasingly effective identification ‘service’ for groups with a well-parametrized barcode reference library. Secondly, it is accelerating taxonomic progress by aiding the recognition of species and by facilitating the connection of their life stages  and sexes , associations that are often challenging without barcode data. For example, more than half of all genera of phorid flies are only known from one sex , creating high risk for synonymy. DNA barcoding also has a strong role in species discovery, especially in little-studied groups, because it can rapidly screen collections for presumptive species, which can then be targeted for taxonomic study . DNA barcodes are additionally being employed to streamline and expedite species descriptions . In fact, in hyperdiverse groups, the BIN registry may be the terminal taxonomic system, one allowing the assembly of morphological, ecological and distributional data for the members of each barcode cluster.
(b) Probing species assemblages
DNA barcoding is a powerful tool for advancing knowledge of species interactions and distributions [98,99]. It is often the sole way to clarify dietary preferences in taxa where direct observation of feeding behaviour is impossible [105,106]. It can also provide new details on host–parasitoid interactions , on pollination syndromes [108,109] and on symbiotic associations [110,111]. Aside from revealing interactions, DNA barcoding allows the assessment of biodiversity on scales  and in settings where this would otherwise be impossible . By exploiting its capacity to improve species recognition and to reveal their interactions, DNA barcoding is also providing new details on food web structure [114–117]. Finally, DNA barcodes have been retrieved from ancient DNA, delivering insights into the evolution and ecology of extinct organisms .
(c) Probing evolution
Although DNA barcoding was initiated to empower taxonomy, the assembly of sequence information for a particular gene region across diverse taxa creates a resource useful in evolutionary contexts . For example, patterns of sequence variation in the barcode region are an effective sentinel for shifts in the nucleotide composition of mitochondrial genomes  and provide a rich source of data for investigating molecular evolutionary rates [121,122]. Because species coverage is so comprehensive, DNA barcoding can also make useful contributions to phylogenetic studies . Other potential applications await exploration. For example, expansion of each barcode record to include the entire sequence for COI or rbcL would deliver an unrivalled database for studying the evolutionary trajectories of these key proteins. It is important to emphasize the mutualism between DNA barcoding and studies which aim for deeper genomic characterization. For example, barcode analysis played an important role in verifying identifications for specimens analysed in the 1KITE initiative  because transcriptomic analysis required specimens to be processed while alive, often making morphological identification impossible. Aside from this role in validating identifications, the DNA extracts resulting from barcode analysis represent a resource for a future when sequencing costs have declined enough to allow the assembly of a whole-genome sequence for every species.
(d) Applying DNA barcodes
Because it facilitates specimen identifications, DNA barcoding has gained adoption in diverse applied contexts. It is, for example, now widely used to identify agricultural and forestry pests and pathogens [61,125,126], to detect invasive species  and for environmental impact assessments . It has also become the standard method for suppressing marketplace fraud  and for deterring trade in endangered wildlife . In addition, it is gaining use in forensic contexts , and in preventing illegal timber harvest . Finally, DNA barcoding has proven a superb vehicle for exposing students to the practice of science .
6. What next?
Given past progress, what goals might the DNA barcode community set for the next quarter century? The assembly of a DNA barcode reference library for all multi-cellular species will effectively write the encyclopedia of life. Moreover, by coupling the automation of specimen identifications with the power of HTS to screen massive numbers of individuals, barcoding will enable a future in which reading life is routine. A global network of stations provisioned with sequencers, computational hardware and autonomous samplers  could track the shifting spectra of species in space and time, an Internet of living things, a world in which organisms act as transducers of biosphere change.
By completing the registry of all species by 2040, biodiversity science would deliver the foundation needed to track and forecast biotic change. Although this advance is within reach, it will require biodiversity science to join those disciplines that regard mega-science as everyday business. New structures, new alliances, and new leaders will be required to propel this transition. There is a critical need for action. More than half of all biodiversity hotspots have lost 90% of their vegetation , and the remnant patches are impacted by climate change. In fact, the least disturbed hotspot, the California Floristic Province, recently experienced its most severe drought in 1500 years . Habitat fragmentation is also increasing; 70% of global forests lie within 1 km of a road . These changes have lowered species abundances  and have increased extinction rates . Because a sixth of all multi-cellular species may be extinct by the end of this century , there is an urgent need to complete the inventory of life and to use this information to track shifts in species abundances and distributions. Without interventions enabled by such knowledge, it is certain that endless forms most beautiful and most wonderful  will be lost. This prospect is surely a call to arms.
P.D.N.H., P.M.H. and M.H. wrote the manuscript.
The authors declare no conflict of interests.
P.D.N.H. gratefully acknowledges the support of the Canada Research Chairs Program, the Canada Foundation for Innovation, NSERC and the Ontario Ministry of Research and Innovation. P.M.H. acknowledges funding from the Scottish Government's Rural and Environment Science and Analytical Services Division. M.H. acknowledges funding from Environment and Climate Change Canada.
We are grateful to Sarah Adamowicz, Karl Kjer, John La Salle, Scott Miller and Dirk Steinke for helpful comments on earlier versions of this article. We also thank Suzanne Bateson, Sujeevan Ratnasingham and Dirk Steinke for their aid in generating the figures. We are very thankful to Ann McCain Evans and Chris Evans for their generosity in defraying the Open Access charges for this special issue.
One contribution of 16 to a theme issue ‘From DNA barcodes to biomes’.
- Accepted May 24, 2016.
- © 2016 The Author(s)
Published by the Royal Society. All rights reserved.