In addition to their regular classes, most US high school students take and re-take a multitude of national standardized tests (and tests to practice for the tests) starting with the PSAT in 10th grade, then various SAT subject tests, AP tests, and the SAT or ACT achievement test (sometimes taken two or more times). Fortunately for students and their parents, this process usually comes to an end once they actually apply to college or university. For DNA barcoding, after six years and 500,000 sequences from 50,000 species, it seems it is still in midst of exams!
In Nov 2008 Mol Phylogenet Evol, in “a test of the DNA barcoding approach,” researchers from University of New Orleans, USA; University of Antioquia, Colombia; and Natural History Museum, London; analyze barcode region COI sequences in a “hyperdiverse” genus (includes about 70 species) of butterflies endemic to Madagascar. They collected 109 specimens of 6 Heteropsis species, including 2 “undescribed species” and 1 species from a related genus. To confuse biological databases, Heteropsis is also a genus of flowering plant in family Aracaea.
As an aside, and I know this is a commonplace observation, there needs to be a way of mapping biodiversity that gets around having “described” and “undescribed” species. For one, many of the “undescribed” species that are the focus of biological study, including perhaps those in this paper, will never be formally described. As an analogy of an alternate approach, in astronomical science, a first step is creating detailed sky maps based on particular wavelengths of the electromagnetic spectrum. Such sky maps are “just data,” in this case recordings of radiation-emitting stellar objects. Then, based on study, astronomers label certain objects as quasars, for example. Of course, this “annotation” does not change the underlying data, and astronomers may later change the labels on some objects based on new information or new understanding.
Following the suggestions of others, I believe some sort of sequence-based map of species-level biodiversity is a necessary way forward. Like the sky map, sequences are “just data” (the “just data” also include collection location, date, voucher specimen, and photographs). Taxonomists would then annotate the “data map” with taxonomic interpretations, assigning species names to particular clusters for example. Species-level taxonomic revisions or conflicting taxonomies are easily accomodated–this simply involves re-labeling a cluster in the former case, or adding alternate names in the latter case. This sequence data map approach explicitly recognizes that species names are hypotheses.
The present system is the inverse of the above: a taxonomic map (ie species names) is “annotated” with sequences. Under this system, there is no easy way to register biological information about organisms unless they have been already formally described as a species. Without a name or description of diagnostic characters, how does the next researcher know if they are studying the same “undescribed” species unless they examine the original specimens (in this case, stored in Natural History Museum, London)? On the other hand, one could easily report biological findings (eg coloration, larval morphology, food plants) associated with a specimen and its barcode sequence.
Going back to the astronomical analogy, barcode-region COI is the appropriate “wavelength” for the species-level map of animals. This map will not be perfect. Just as gravitational lensing distorts the positions of some stellar objects, and others are obscured by intergalactic dust, the COI wavelength map will mislead in some areas and be obscured in others, not enabling one to “see” the existence of certain species–e.g. corals with slow mtDNA sequence evolution.
Would this be “DNA taxonomy?” No. First, community standards would ensure that the sequence map is not the arbiter of species status. Just as there are morphologically cryptic species and others that are phenotypically diverse, what we recognize as distinct species might be “hidden” within a single sequence cluster, and on the other hand, some named species might comprise a set of more distantly related sequences. Thus there would not be a fixed numerical determinant (eg distance, characters) of what constitutes a species cluster. Second, the COI-wavelength map would not establish higher-level relationships. Of course the shape and distinctness (or lack thereof) of clusters will change as new sequence data becomes available, as well as the taxonomic annotation, but that is the nature of biological diversity–we just don’t know everything yet!
Why bother? Taking a sequence-mapping approach, I believe one can accelerate exploration of biodiversity and harness efforts of those outside the taxonomic priesthood. For example, one can predict much of the next ten years of species- and genus-level revisions in avian taxonomy simply on the basis of currently available COI and other mitochondrial DNA data (sequence plus specimen data). I suggest to collate and disseminate the available data in a publicly accessible form. This might even help harness “citizen science” by encouraging submission of birds that died of natural causes, or feathers naturally shed or collected in banding/ringing operations (or barcodes of feathers for those with resources and access to sequencing facilities), along with date, gps coordinates, and digital photo. If so, then legions of devoted birders could help with creating the genetic map, as they are already doing with observational records (see eBird). Because collecting sequence (and specimen-associated data) that establish the map is separate from the taxonomic process of “naming” this would not devolve into taxonomic chaos, rather, like astronomy amateurs, citizens could contribute to the observational database on which the sequence map is built. One utility that is needed is an easy graphical interface that collates available mtDNA on birds for example and highlights areas where information is missing either taxonomically or geographically; this sort of display would likely be of interest both to scientists and scientifically-minded amateurs.
Finally, going further out on this limb, depending on community standards, there might be agreement to consider a sufficiently divergent cluster a new species, until proven otherwise by more biological data. So a specimen plus a sequence could potentially be a “described species.” I prefer keeping the everyday designation of “species” rather than for example molecular operational taxonomic units (MOTU) or (evolutionarily significant units (ESU), but that is a discussion for another time!
Back to the paper. Linares and colleagues found that all 6 Heteropsis spp (including the 2 “undescribed” species) were evolutionarily distinct (ie formed reciprocally monophyletic lineages in Maximum Likelihood and Bayesian analysis of barcode-region COI), and that the mtDNA phylogeny was corroborated by nuclear DNA sequences. Given the large distances among and small within species, a neighbor-joining tree would likely have shown the same species clusters (although not necessarily the same branching pattern; the COI sequences do not appear to be public on GenBank yet, so I could not try NJ analysis). Unsurprisingly, one species pair showed less than “10X distance” (ie interspecies distance less than 10 times the average intra-species distance). Most barcode studies that include multiple congeneric species have sister species pairs that fall below this threshold. The results were initially confounded by amplification of Wolbachia (an intracellular parasitic bacteria of insects) DNA, leading them to design alternate primers. Wolbachia is unevenly distributed in tissues and often concentrated in reproductive tract, so perhaps the use of abdominal segments for DNA extraction is part of the reason this was a problem.