The Barcode Blog

A mostly scientific blog about short DNA sequences for species identification and discovery. I encourage your commentary. -- Mark Stoeckle

Subscribe to this blog

Sign up for email notifications

Distances and characters

Almost 4 years ago, in October 2005 Philos Trans R Soc Lond B Biol Sci researchers from American Museum of Natural History examined the then nascent DNA barcoding effort, looking at what methods were best for integrating the growing pool of DNA barcode data into systematics, the science of classifying organisms based on evolutionary history. Using real-world examples, authors DeSalle, Egan, and Siddall argued strongly for “characters” and against “distances” when using DNA barcode data to identify species, ie assigning specimens to known species and discovering new species. Of course, sequence data was already the backbone of modern systematics but it had primarily been applied to reconstructing evolutionary branching patterns (eg what pattern of divergences led to the various orders of birds) and less so to the definition of species. For example, most phylogenetic work included single examplars of each species.  Analyzing sequence differences among and within closely-related species was more the domain of phylogeography which generally did not explicitly aim to define new species. 

Here a brief aside. In analyzing sequences, “characters” refer to specific nucleotides (eg guanine (G) at position 138 in COI gene) and “distances” refer to per cent differences between sequences. So right away you can see that “characters” are intrinsic to the specimen’s DNA, whereas distances are defined only in relation to sequences from other specimens. Systematists like characters; for one, this enables integrating sequence and morphologic data. Characters are the grist for the computational workhorses of systematics, Parsimony and Maximum Likelihood. Meanwhile, beginning with the first paper published in 2003, distances displayed in neighbor-joining trees have been the usual heuristic approach for analyzing DNA barcode differences among and within species. A crucial advantage of neighbor-joining distance analysis is speed. Creating a NJ distance tree from 1000 648 bp barcode sequences might take a minute on a desktop computer whereas Maximum Likelihood reconstruction might take several weeks. Unlike reconstructing the Tree of Life, DNA barcoding is a recurrent exercise that repeatedly involves submitting new data from multiple known and unknown specimens, so a fast analytic method is essential. 

Four years later, where are we? Most DNA barcoding analyses continue to rely on NJ distance trees, and this approach has proven to be a durable heuristic, enabling one to distinguish among most species analyzed so far. Regarding species discovery, NJ distance trees demonstrate continued value as a first step in flagging divergent lineages that may represent new species. Here there is something of a roadblock, in that defining new species is a human judgement, sort of like a medical diagnosis, while sequences differences are like medical laboratory results. Community standards do not accept divergent mtDNA sequences as sufficient evidence to define a new species, although at the same time it is generally acknowledged that such sequences do indicate it is new, albeit one that hasn’t been officially defined yet. For example, in Nov 2008 news item researchers confidently assert “DNA tests identify new dolphin species,” (based on published article in Nov 2008 Mol Phylogenet Evol), yet include statement “it is awaiting a scientific name after a formal description.” I expect the researchers knew they had a new species with the first mtDNA sequence from a single individual! For DNA barcoding effort it should not be necessary to wait for final taxonomic decisions; we can proceed with publicly-disseminating a broad-range, fine-scale map of biodiversity, which can then be annotated with taxonomic information as it arrives. Like sky surveys and the human genome project, we should aim to make the “barcode biodiversity map” public as quickly as possible.  

On the other side, it is now a commonplace observation that a 10X threshold (10 times the average intraspecific variation) is NOT a universal dividing line between intra- and inter-specific variation. To get technical, this was originally proposed as a screen for new species, but it has been taken as a dividing line between intra- and inter-specific distances, which it certainly is not; in the original 2004 paper (I am co-author) there are many sister species separated by distances less than the threshold. It has been a useful rhetorical target so maybe this issue won’t disappear just yet.

On the character front, there are more publications defining discriminatory DNA barcodes characters (eg Tavares and Baker 9 march 2008 BMC Evol Biol). It seems obvious to me that if, as is usually the case, sister species show large differences among and small differences within, then there must be diagnostic characters that distinguish them. The process of “translating” distances into characters should perhaps be a standard practice for nearest neighbor taxa in NJ trees; this would certainly give confidence (or not) as to whether one can reliably distinguish those species with less than 1% sequence difference. There is exciting development in character-based software tools (eg Ahrens et al 2007, Rosenberg 2007, Abdo and Golding 2007Munch et al 2008) aimed at distinguishing the leaves (ie species) in addition to those already available for reconstructing the branches on the Tree of Life. I look forward to one that is friendly for non-specialists and works speedily on desktops!

This entry was posted on Wednesday, January 28th, 2009 at 1:45 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “Distances and characters”

  1. AJ Says:

    As someone who is new to barcoding and coming at it from the application standpoint in an ecological study similar to one mentioned below(identifying plant material from unknown individuals), I was surprised that there was no concensus as far as the best method to “identify” specimens that have been barcoded. It must be that most attention has been given to generating the barcodes, and much less to applying them after a reference database of barcodes has been generated. After looking at the above articles, I’m also surprised that some of the tree based phylogenetic Bayesian methods are so incredibly slow.

    How is it that I can enter a search of a few words (or entire sentences) into Google and it gives me an incredibly close match to what I’m looking for (usually) but identifying species from a limited set of 600 characters or so is slow slow? Also, a variety of plagiarism websites and programs are availabe that seem fairly fast as well. What am I missing?

  2. Mark Stoeckle Says:

    There are very rapid tools for searching existing nucleotide databases for “best matches” to a query sequence, including for example DNA barcodes. BLAST (basic local alignment search tool) is available on GenBank http://blast.ncbi.nlm.nih.gov/Blast.cgi and is the algorithm that underlies the search engine on BOLD http://www.barcodinglife.org. In either database, searching with a DNA barcode sequence will return a set of best matches, usually within a few seconds. On the other hand tree-building can be very slow, as mentioned in the post.

    Database identifications require having reference sequences from the relevant species and their close relatives in the library. There are an estimated 1.7 million or so named animal and plant species, while Barcode of Life Database (BOLD) has sequences from about 50,000 species so far, so the biggest challenge is building up the databases, and this requires concerted effort, which is what the Barcode of Life Initiative is about. In groups with incomplete database coverage, partial identifications (eg to family or genus level) are usually possible. Sometimes the answers are unclear, as there may be shared or overlapping barcodes among very young species, or species that hybridize regularly, but if the relevant sequences are in the database, the matching process is usually rapid and straightforward.

Contact: mark.stoeckle@rockefeller.edu

About this site

This web site is an outgrowth of the Taxonomy, DNA, and Barcode of Life meeting held at Banbury Center, Cold Spring Harbor Laboratory, September 9-12, 2003. It is designed and managed by Mark Stoeckle, Perrin Meyer, and Jason Yung at the Program for the Human Environment (PHE) at The Rockefeller University.

About the Program for the Human Environment

The involvement of the Program for the Human Environment in DNA barcoding dates to Jesse Ausubel's attendance in February 2002 at a conference in Nova Scotia organized by the Canadian Center for Marine Biodiversity. At the conference, Paul Hebert presented for the first time his concept of large-scale DNA barcoding for species identification. Impressed by the potential for this technology to address difficult challenges in the Census of Marine Life, Jesse agreed with Paul on encouraging a conference to explore the contribution taxonomy and DNA could make to the Census as well as other large-scale terrestrial efforts. In his capacity as a Program Director of the Sloan Foundation, Jesse turned to the Banbury Conference Center of Cold Spring Harbor Laboratory, whose leader Jan Witkowski prepared a strong proposal to explore both the scientific reliability of barcoding and the processes that might bring it to broad application. Concurrently, PHE researcher Mark Stoeckle began to work with the Hebert lab on analytic studies of barcoding in birds. Our involvement in barcoding now takes 3 forms: assisting the organizational development of the Consortium for the Barcode of Life and the Barcode of Life Initiative; contributing to the scientific development of the field, especially by studies in birds, and contributing to public understanding of the science and technology of barcoding and its applications through improved visualization techniques and preparation of brochures and other broadly accessible means, including this website. While the Sloan Foundation continues to support CBOL through a grant to the Smithsonian Institution, it does not provide financial support for barcoding research itself or support to the PHE for its research in this field.