Our close colleague Paul Waggoner’s paper “Using the Forest Identity to grasp and comprehend the swelling mass of forest statistics” appeared in the December 2008 issue of the International Forestry Review. The paper compactly explains some of the strengths and limits of the Forest Identity.
News
DNA barcoding re-tested in Madagascar butterflies
In addition to their regular classes, most US high school students take and re-take a multitude of national standardized tests (and tests to practice for the tests) starting with the PSAT in 10th grade, then various SAT subject tests, AP tests, and the SAT or ACT achievement test (sometimes taken two or more times). Fortunately for students and their parents, this process usually comes to an end once they actually apply to college or university. For DNA barcoding, after six years and 500,000 sequences from 50,000 species, it seems it is still in midst of exams!
In Nov 2008 Mol Phylogenet Evol, in “a test of the DNA barcoding approach,” researchers from University of New Orleans, USA; University of Antioquia, Colombia; and Natural History Museum, London; analyze barcode region COI sequences in a “hyperdiverse” genus (includes about 70 species) of butterflies endemic to Madagascar. They collected 109 specimens of 6 Heteropsis species, including 2 “undescribed species” and 1 species from a related genus. To confuse biological databases, Heteropsis is also a genus of flowering plant in family Aracaea.
As an aside, and I know this is a commonplace observation, there needs to be a way of mapping biodiversity that gets around having “described” and “undescribed” species. For one, many of the “undescribed” species that are the focus of biological study, including perhaps those in this paper, will never be formally described. As an analogy of an alternate approach, in astronomical science, a first step is creating detailed sky maps based on particular wavelengths of the electromagnetic spectrum. Such sky maps are “just data,” in this case recordings of radiation-emitting stellar objects. Then, based on study, astronomers label certain objects as quasars, for example. Of course, this “annotation” does not change the underlying data, and astronomers may later change the labels on some objects based on new information or new understanding.
Following the suggestions of others, I believe some sort of sequence-based map of species-level biodiversity is a necessary way forward. Like the sky map, sequences are “just data” (the “just data” also include collection location, date, voucher specimen, and photographs). Taxonomists would then annotate the “data map” with taxonomic interpretations, assigning species names to particular clusters for example. Species-level taxonomic revisions or conflicting taxonomies are easily accomodated–this simply involves re-labeling a cluster in the former case, or adding alternate names in the latter case. This sequence data map approach explicitly recognizes that species names are hypotheses.
The present system is the inverse of the above: a taxonomic map (ie species names) is “annotated” with sequences. Under this system, there is no easy way to register biological information about organisms unless they have been already formally described as a species. Without a name or description of diagnostic characters, how does the next researcher know if they are studying the same “undescribed” species unless they examine the original specimens (in this case, stored in Natural History Museum, London)? On the other hand, one could easily report biological findings (eg coloration, larval morphology, food plants) associated with a specimen and its barcode sequence.
Going back to the astronomical analogy, barcode-region COI is the appropriate “wavelength” for the species-level map of animals. This map will not be perfect. Just as gravitational lensing distorts the positions of some stellar objects, and others are obscured by intergalactic dust, the COI wavelength map will mislead in some areas and be obscured in others, not enabling one to “see” the existence of certain species–e.g. corals with slow mtDNA sequence evolution.
Would this be “DNA taxonomy?” No. First, community standards would ensure that the sequence map is not the arbiter of species status. Just as there are morphologically cryptic species and others that are phenotypically diverse, what we recognize as distinct species might be “hidden” within a single sequence cluster, and on the other hand, some named species might comprise a set of more distantly related sequences. Thus there would not be a fixed numerical determinant (eg distance, characters) of what constitutes a species cluster. Second, the COI-wavelength map would not establish higher-level relationships. Of course the shape and distinctness (or lack thereof) of clusters will change as new sequence data becomes available, as well as the taxonomic annotation, but that is the nature of biological diversity–we just don’t know everything yet!
Why bother? Taking a sequence-mapping approach, I believe one can accelerate exploration of biodiversity and harness efforts of those outside the taxonomic priesthood. For example, one can predict much of the next ten years of species- and genus-level revisions in avian taxonomy simply on the basis of currently available COI and other mitochondrial DNA data (sequence plus specimen data). I suggest to collate and disseminate the available data in a publicly accessible form. This might even help harness “citizen science” by encouraging submission of birds that died of natural causes, or feathers naturally shed or collected in banding/ringing operations (or barcodes of feathers for those with resources and access to sequencing facilities), along with date, gps coordinates, and digital photo. If so, then legions of devoted birders could help with creating the genetic map, as they are already doing with observational records (see eBird). Because collecting sequence (and specimen-associated data) that establish the map is separate from the taxonomic process of “naming” this would not devolve into taxonomic chaos, rather, like astronomy amateurs, citizens could contribute to the observational database on which the sequence map is built. One utility that is needed is an easy graphical interface that collates available mtDNA on birds for example and highlights areas where information is missing either taxonomically or geographically; this sort of display would likely be of interest both to scientists and scientifically-minded amateurs.
Finally, going further out on this limb, depending on community standards, there might be agreement to consider a sufficiently divergent cluster a new species, until proven otherwise by more biological data. So a specimen plus a sequence could potentially be a “described species.” I prefer keeping the everyday designation of “species” rather than for example molecular operational taxonomic units (MOTU) or (evolutionarily significant units (ESU), but that is a discussion for another time!
Back to the paper. Linares and colleagues found that all 6 Heteropsis spp (including the 2 “undescribed” species) were evolutionarily distinct (ie formed reciprocally monophyletic lineages in Maximum Likelihood and Bayesian analysis of barcode-region COI), and that the mtDNA phylogeny was corroborated by nuclear DNA sequences. Given the large distances among and small within species, a neighbor-joining tree would likely have shown the same species clusters (although not necessarily the same branching pattern; the COI sequences do not appear to be public on GenBank yet, so I could not try NJ analysis). Unsurprisingly, one species pair showed less than “10X distance” (ie interspecies distance less than 10 times the average intra-species distance). Most barcode studies that include multiple congeneric species have sister species pairs that fall below this threshold. The results were initially confounded by amplification of Wolbachia (an intracellular parasitic bacteria of insects) DNA, leading them to design alternate primers. Wolbachia is unevenly distributed in tissues and often concentrated in reproductive tract, so perhaps the use of abdominal segments for DNA extraction is part of the reason this was a problem.
Jesse visits Niceville High School in FL
The students of Niceville High School in the Florida Panhandle regularly contribute valuable samples from the sandy bottom of their beautiful shoreline to the near-shore (NaGISA) field project of the Census of Marine Life. On 23 January, Jesse had the privilege of visiting with the students and their exceptional instructor Rick Hernandez. The Okaloosa County School district reported the visit.
Distances and characters
Almost 4 years ago, in October 2005 Philos Trans R Soc Lond B Biol Sci researchers from American Museum of Natural History examined the then nascent DNA barcoding effort, looking at what methods were best for integrating the growing pool of DNA barcode data into systematics, the science of classifying organisms based on evolutionary history. Using real-world examples, authors DeSalle, Egan, and Siddall argued strongly for “characters” and against “distances” when using DNA barcode data to identify species, ie assigning specimens to known species and discovering new species. Of course, sequence data was already the backbone of modern systematics but it had primarily been applied to reconstructing evolutionary branching patterns (eg what pattern of divergences led to the various orders of birds) and less so to the definition of species. For example, most phylogenetic work included single examplars of each species. Analyzing sequence differences among and within closely-related species was more the domain of phylogeography which generally did not explicitly aim to define new species.
Here a brief aside. In analyzing sequences, “characters” refer to specific nucleotides (eg guanine (G) at position 138 in COI gene) and “distances” refer to per cent differences between sequences. So right away you can see that “characters” are intrinsic to the specimen’s DNA, whereas distances are defined only in relation to sequences from other specimens. Systematists like characters; for one, this enables integrating sequence and morphologic data. Characters are the grist for the computational workhorses of systematics, Parsimony and Maximum Likelihood. Meanwhile, beginning with the first paper published in 2003, distances displayed in neighbor-joining trees have been the usual heuristic approach for analyzing DNA barcode differences among and within species. A crucial advantage of neighbor-joining distance analysis is speed. Creating a NJ distance tree from 1000 648 bp barcode sequences might take a minute on a desktop computer whereas Maximum Likelihood reconstruction might take several weeks. Unlike reconstructing the Tree of Life, DNA barcoding is a recurrent exercise that repeatedly involves submitting new data from multiple known and unknown specimens, so a fast analytic method is essential.
Four years later, where are we? Most DNA barcoding analyses continue to rely on NJ distance trees, and this approach has proven to be a durable heuristic, enabling one to distinguish among most species analyzed so far. Regarding species discovery, NJ distance trees demonstrate continued value as a first step in flagging divergent lineages that may represent new species. Here there is something of a roadblock, in that defining new species is a human judgement, sort of like a medical diagnosis, while sequences differences are like medical laboratory results. Community standards do not accept divergent mtDNA sequences as sufficient evidence to define a new species, although at the same time it is generally acknowledged that such sequences do indicate it is new, albeit one that hasn’t been officially defined yet. For example, in Nov 2008 news item researchers confidently assert “DNA tests identify new dolphin species,” (based on published article in Nov 2008 Mol Phylogenet Evol), yet include statement “it is awaiting a scientific name after a formal description.” I expect the researchers knew they had a new species with the first mtDNA sequence from a single individual! For DNA barcoding effort it should not be necessary to wait for final taxonomic decisions; we can proceed with publicly-disseminating a broad-range, fine-scale map of biodiversity, which can then be annotated with taxonomic information as it arrives. Like sky surveys and the human genome project, we should aim to make the “barcode biodiversity map” public as quickly as possible.
On the other side, it is now a commonplace observation that a 10X threshold (10 times the average intraspecific variation) is NOT a universal dividing line between intra- and inter-specific variation. To get technical, this was originally proposed as a screen for new species, but it has been taken as a dividing line between intra- and inter-specific distances, which it certainly is not; in the original 2004 paper (I am co-author) there are many sister species separated by distances less than the threshold. It has been a useful rhetorical target so maybe this issue won’t disappear just yet.
On the character front, there are more publications defining discriminatory DNA barcodes characters (eg Tavares and Baker 9 march 2008 BMC Evol Biol). It seems obvious to me that if, as is usually the case, sister species show large differences among and small differences within, then there must be diagnostic characters that distinguish them. The process of “translating” distances into characters should perhaps be a standard practice for nearest neighbor taxa in NJ trees; this would certainly give confidence (or not) as to whether one can reliably distinguish those species with less than 1% sequence difference. There is exciting development in character-based software tools (eg Ahrens et al 2007, Rosenberg 2007, Abdo and Golding 2007, Munch et al 2008) aimed at distinguishing the leaves (ie species) in addition to those already available for reconstructing the branches on the Tree of Life. I look forward to one that is friendly for non-specialists and works speedily on desktops!
Soy el arbol!!
In Chile in early January we visited some of the world’s largest tree plantations as well as the wondrous intact indigenous forest of Isla Mocha. Thanks to Savithri Narayanan for her photo of this sign “Soy el arbol!!” on a tree in a Chilean park with its true and poetic message (translated into English by Jesse).
“A reliable, consistent, and democratic tool for species discrimination”
Human filariasis, caused by various species of insect-transmitted parasitic nematodes, affects more than 120 million persons in Africa, South America, and Southeast Asia, and includes elephantiasis and river blindness. In 7 january 2009 Frontiers Zool, 10 researchers from 5 institutions in Italy, France, Japan, and Venezuela apply DNA barcoding and traditional morphologic taxonomy to identification of parasitic filarioid worms. According to the authors, a molecular tool for identification of filiaria is a “desirable goal for many reasons” including “parasites conferred to diagnostic laboratories are often of poor quality due to the difficult[y] of sampling adults and undamaged organisms,” as a “method for the identification of filarioid nematodes in vectors,” and “nematode biodiversity is still highly underestimated both at the morphological and molecular level.”
Ferri and colleagues analyze diagnostic utility of 12S and barcode-region COI sequences and morphologic examination by experts to an assemblage of data from 165 individual specimens (73 newly analyzed for this study) representing about 60 species. Their data set encompasses most of the important human and animal filarioid parasites, including Wuchereria bancrofti and Brugia malayi, agents of human tropical elephantiasis, Loa loa (human ocular filariasis), Onchocerca volvulus (human river blindness), and Dirofilaria immitis (dog and cat heartworm), plus specimens recovered from wild animals ranging from bats to toads.
The authors applied a medical test approach to the sequence data, looking at which distance cutoffs produced “minimum cumulative error,” in which they include type I false positive (failure to assign to correct species, analogous to oversplitting) and type II false negative (failure to distinguish between species; analogous to lumping). I find their approach refreshing in that it recognizes the uncertainty inherent in any identification method. Even “gold standard” tests have error rates. Just as a medical laboratory considers a range of factors when adopting a new test method–cost, speed, sensitivity, accuracy, replicability, and training requirements, for example, we might usefully look at methods for species identification, including traditional morphologic techniques, in a similar way. In taking such an approach, we can recognize there are often marked differences between the methods we use to detect something and the methods used to define it.
As a medical testing example, automated systems for rapid detection of bacteria in blood cultures rely on monitoring pressure changes in headspace gas in liquid culture bottles, as growing bacteria consume or produce gases. At the same time we do not define bacteria as “organisms that produce pressure changes in laboratory culture bottles,” for example. Similarly, percent differences between nucleotide sequences of the test specimen and those in a reference library might be a rapid way to “detect” a species, but this does not mean these are a defining characteristic of a species. We recognize species conceptually as independent evolutionary lineages, and practically on the basis of discriminatory characters (eg morphologic, behavioral, or nucleotide substitutions at specific sites). In the day-to-day work of specimen identification and detection of new species however, sequence distances may work just fine as diagnostic signatures.
Back to the article. Ferri and colleagues report COI worked better than 12S as a diagnostic, primarily due to difficulty in finding a consistent algorithm for aligning 12S sequences. With COI, the minimum cumulative error was 0.62% at a K2P distance threshold of 4.8%. The errors were due to low interspecific distances between 2 congeneric pairs [Onchocerca volvulus (human host) and O. ochengi (cattle); Cercopithifilaria longa (Japanese serow, a goat-antelope) and C. bulboidea (Sika deer); might some of the morphologic differences between these species pairs represent phenotypic changes induced by the different hosts?]. More sampling within species will help determine if it is possible to molecularly discriminate among these species using a character- rather than distance-based method.
The authors call for an integrated taxonomic approach to solve discrepancies between morphologic and molecular methods, and conclude “we propose DNA barcoding as a reliable, consistent, and democratic tool for species discrimination in routine identification of parasitic nematodes.”
DNA speeds discovery of overlooked species
Just as new telescopes reveal previously hidden details of the universe, genetic surveys regularly reveal previously hidden (aka cryptic) species. Of course these species are cryptic only in the sense that morphological analysis is not the right tool to “see” them with. To my ear the word “cryptic” suggests camouflaged organisms that blend in with the environment, such as the Dead leaf butterfly Kallima inachus. Unlike camouflage, which is presumably a protection adaption, it is my impression there is nothing biologically special about morphologic crypsis except for the difficulty we have in recognizing it; that is, what we call cryptic species exhibit the same sorts of distinct ecological and behavioral adaptations found in those whose differences are more visible to the human eye.
To restate the above, when multiple individuals are examined for gene(s) that reflect species-level differences (this is the essence of DNA barcoding), many animal and at least some plant species are discovered to be comprised of two or more genetic clusters, each carrying diagnostic nucleotide substitutions. When appropriate analytic tools are applied, these within-species clusters are often found to be reciprocally monophyletic lineages that have been reproductively isolated for hundreds of thousands to millions of years. In studies where the painstaking work of natural history observation has been carried out, these genetic clusters usually show ecological and behavioral differences and sometimes previously overlooked morphological distinctions, consistent with species-level status. In short, DNA analysis speeds discovery of new species. In many cases, it reveals species that would otherwise probably remain unrecognized indefinitely.
The premise of DNA barcoding is that a very short segment (ie for animals 648 bp COI barcode region) is usually sufficient to screen for new species and to assign specimens to known species. Of course, more sequencing is always of interest, but the added discriminatory value for detecting species-level differences is small compared to the added cost. Moving backwards in evolutionary time, a neighbor joining tree constructed with 648 bp barcode sequences often groups genera and families correctly; however it generally does not contain enough information to establish branching order or uncover deeper-level associations that are the heart of phylogenetic study, so there is plenty for systematists to do.
Now for some data. In 25 december 2008 Mol Phylo Evol researchers from University of Gothenburg and University of Florida report on Lumbriculus variegatus Muller, 1774, a segmented freshwater worm widely distributed in Europe and North America, commonly used as a model laboratory organism, in environmental toxicology, and sold as pet food for fish and amphibians under name “blackworm.” Part of the laboratory interest in L. variegatus lies in its remarkable ability to re-generate after fragmentation; any of the approximately 200 segments can re-form a complete adult worm; most populations reproduce through auto-fragmentation. Given that L. variegatus is a common, widely-distributed organism described over 200 years ago and is regularly used in scientific study, one might not expect any taxonomic surprises.
Gustafsson and colleagues were initially studying a neuropeptide gene FMRFamide using L. variegatus purchased from a commercial supplier in California, with puzzling results suggesting polyploidy with multiple gene copies. This lead them to further characterize approximately 50 individuals collected at multiple sites in Europe and North America. Sequencing of COI, 16S, and ITS sorted the specimens into 2 phylogenetically distinct (maximum parsimony and Bayesian analysis) clades with 17% mean difference in COI, with the same genetic structure in mitochondrial COI/16S as nuclear ITS. Both clades were found in North America and Europe, sometimes at the same site. The authors conclude “it thus seems reasonable to regard these two main lineages within the L. variegatus complex as different species, regardless of which species concept one adheres to.” Of course, it may be they have rediscovered a named species; they caution that more study needs to be done including sampling the other named species in genus Lumbriculus (see EOL page).
DNA barcoding is an efficient instrument for revealing species-level differences. Routine application of DNA barcoding can enhance quality control in work with model organisms, cell lines, and collected specimens, and the long-term value of species descriptions.
WSJ Tucker
An opinion piece in the 29 December 2008 Wall Street Journal by William Tucker quotes Jesse’s low view of investments in so-called renewable energy sources from this interview in Weltwoche magazine.
Best wishes for 2009
The past year with the Barcode Blog has been exciting and challenging. Looking forward to 2009!
Mark Stoeckle
Program for the Human Environment
The Rockefeller University
Plant specialists work towards standardization
In 26 November 2008 Mol Ecol researchers from University of British Columbia report on a meeting of 1200 plant specialists, entitled “Botany without Borders”, held on the campus in July 2008, which brought together the annual meetings of Botanical Society of America, the Canadian Botanical Association/L’Association Botanique du Canada, American Fern Society, and American Society of Plant Taxonomists. According to authors Kane and Cronk, DNA barcoding was a recurring theme of presentations and posters.
Plants continue to challenge a standardized approach to species identification using short DNA sequences from a uniform location on the genome, aka DNA barcoding. Genetic divergences among lineages make it difficult to design broad-range primers that amplify a desired target region across the diversity of plants and, at the same time, sequence differences among closely-related plant species are generally an order of magnitude fewer than those among animals, with the result that short sequences are often inadequate to assign specimens to species. Looking beyond these difficulties, the potential societal and scientific value of a standardized genetic identification method for plants is enormous. For one example cited in the meeting report, wild nutmeg trees of the genus Compsoneura can be identified by examining the tiny flowers on male trees, but trees are usually not in flower and female trees always lack these distinguishing characters. (It is remarkable that something as large as a tree can sometimes not be identified even by specialists!) In one study (Newmaster, Mol Ecol Notes 2007), a DNA barcoding approach using 2 short plastid sequences enabled identification of 94.7% of samples to species, compared to 40% using field characters. A standardized DNA-based approach should be a big boost to soil science by enabling the underground parts of plants, ie roots, to be readily named (Ridgway, BMC Ecol 2003).
The authors conclude “DNA barcoding in plants is clearly here to stay and there is consequently an urgent need to rise to the scientific challenges it presents.” Some of those scientific challenges are explored in November 2008 Taxon by researchers from National Museum of Natural History, Washington, D.C., and National Center for Biotechnology Information, Bethesda, Maryland. Erickson and colleagues lay out a set of standard approaches to quantifying DNA barcoding success in plants.
The authors state “PCR amplification must be the primary criterion for selecting a DNA barcode,” i.e. the chosen region should have the best rate of successful amplification across the diversity of plants. They suggest 90% or greater rate of recovery as a guideline. Second, they suggest each or any additional markers should improve PCR success by reducing the number of non-recovered PCRs by 50% and improve identification by at least 10%, using a parameter they call “probability of correct identification (PCI),” which is defined pretty much as it sounds. Applying this statistic to existing plant studies indicates the best results are with 2 plastid barcodes in which case PCI approaches an average of 90%, which of course includes much lower rates among some groups. Nonetheless, in local flora successful identification to species level may often approach 100%, because closely-related congeneric species are not present. The effort to establish a standardized genetic library of DNA barcodes for world’s plants is moving ahead.