Distances and characters

Almost 4 years ago, in October 2005 Philos Trans R Soc Lond B Biol Sci researchers from American Museum of Natural History examined the then nascent DNA barcoding effort, looking at what methods were best for integrating the growing pool of DNA barcode data into systematics, the science of classifying organisms based on evolutionary history. Using real-world examples, authors DeSalle, Egan, and Siddall argued strongly for “characters” and against “distances” when using DNA barcode data to identify species, ie assigning specimens to known species and discovering new species. Of course, sequence data was already the backbone of modern systematics but it had primarily been applied to reconstructing evolutionary branching patterns (eg what pattern of divergences led to the various orders of birds) and less so to the definition of species. For example, most phylogenetic work included single examplars of each species.  Analyzing sequence differences among and within closely-related species was more the domain of phylogeography which generally did not explicitly aim to define new species. 

Here a brief aside. In analyzing sequences, “characters” refer to specific nucleotides (eg guanine (G) at position 138 in COI gene) and “distances” refer to per cent differences between sequences. So right away you can see that “characters” are intrinsic to the specimen’s DNA, whereas distances are defined only in relation to sequences from other specimens. Systematists like characters; for one, this enables integrating sequence and morphologic data. Characters are the grist for the computational workhorses of systematics, Parsimony and Maximum Likelihood. Meanwhile, beginning with the first paper published in 2003, distances displayed in neighbor-joining trees have been the usual heuristic approach for analyzing DNA barcode differences among and within species. A crucial advantage of neighbor-joining distance analysis is speed. Creating a NJ distance tree from 1000 648 bp barcode sequences might take a minute on a desktop computer whereas Maximum Likelihood reconstruction might take several weeks. Unlike reconstructing the Tree of Life, DNA barcoding is a recurrent exercise that repeatedly involves submitting new data from multiple known and unknown specimens, so a fast analytic method is essential. 

Four years later, where are we? Most DNA barcoding analyses continue to rely on NJ distance trees, and this approach has proven to be a durable heuristic, enabling one to distinguish among most species analyzed so far. Regarding species discovery, NJ distance trees demonstrate continued value as a first step in flagging divergent lineages that may represent new species. Here there is something of a roadblock, in that defining new species is a human judgement, sort of like a medical diagnosis, while sequences differences are like medical laboratory results. Community standards do not accept divergent mtDNA sequences as sufficient evidence to define a new species, although at the same time it is generally acknowledged that such sequences do indicate it is new, albeit one that hasn’t been officially defined yet. For example, in Nov 2008 news item researchers confidently assert “DNA tests identify new dolphin species,” (based on published article in Nov 2008 Mol Phylogenet Evol), yet include statement “it is awaiting a scientific name after a formal description.” I expect the researchers knew they had a new species with the first mtDNA sequence from a single individual! For DNA barcoding effort it should not be necessary to wait for final taxonomic decisions; we can proceed with publicly-disseminating a broad-range, fine-scale map of biodiversity, which can then be annotated with taxonomic information as it arrives. Like sky surveys and the human genome project, we should aim to make the “barcode biodiversity map” public as quickly as possible.  

On the other side, it is now a commonplace observation that a 10X threshold (10 times the average intraspecific variation) is NOT a universal dividing line between intra- and inter-specific variation. To get technical, this was originally proposed as a screen for new species, but it has been taken as a dividing line between intra- and inter-specific distances, which it certainly is not; in the original 2004 paper (I am co-author) there are many sister species separated by distances less than the threshold. It has been a useful rhetorical target so maybe this issue won’t disappear just yet.

On the character front, there are more publications defining discriminatory DNA barcodes characters (eg Tavares and Baker 9 march 2008 BMC Evol Biol). It seems obvious to me that if, as is usually the case, sister species show large differences among and small differences within, then there must be diagnostic characters that distinguish them. The process of “translating” distances into characters should perhaps be a standard practice for nearest neighbor taxa in NJ trees; this would certainly give confidence (or not) as to whether one can reliably distinguish those species with less than 1% sequence difference. There is exciting development in character-based software tools (eg Ahrens et al 2007, Rosenberg 2007, Abdo and Golding 2007Munch et al 2008) aimed at distinguishing the leaves (ie species) in addition to those already available for reconstructing the branches on the Tree of Life. I look forward to one that is friendly for non-specialists and works speedily on desktops!

2 thoughts on “Distances and characters

  1. As someone who is new to barcoding and coming at it from the application standpoint in an ecological study similar to one mentioned below(identifying plant material from unknown individuals), I was surprised that there was no concensus as far as the best method to “identify” specimens that have been barcoded. It must be that most attention has been given to generating the barcodes, and much less to applying them after a reference database of barcodes has been generated. After looking at the above articles, I’m also surprised that some of the tree based phylogenetic Bayesian methods are so incredibly slow.

    How is it that I can enter a search of a few words (or entire sentences) into Google and it gives me an incredibly close match to what I’m looking for (usually) but identifying species from a limited set of 600 characters or so is slow slow? Also, a variety of plagiarism websites and programs are availabe that seem fairly fast as well. What am I missing?

  2. There are very rapid tools for searching existing nucleotide databases for “best matches” to a query sequence, including for example DNA barcodes. BLAST (basic local alignment search tool) is available on GenBank http://blast.ncbi.nlm.nih.gov/Blast.cgi and is the algorithm that underlies the search engine on BOLD http://www.barcodinglife.org. In either database, searching with a DNA barcode sequence will return a set of best matches, usually within a few seconds. On the other hand tree-building can be very slow, as mentioned in the post.

    Database identifications require having reference sequences from the relevant species and their close relatives in the library. There are an estimated 1.7 million or so named animal and plant species, while Barcode of Life Database (BOLD) has sequences from about 50,000 species so far, so the biggest challenge is building up the databases, and this requires concerted effort, which is what the Barcode of Life Initiative is about. In groups with incomplete database coverage, partial identifications (eg to family or genus level) are usually possible. Sometimes the answers are unclear, as there may be shared or overlapping barcodes among very young species, or species that hybridize regularly, but if the relevant sequences are in the database, the matching process is usually rapid and straightforward.

Leave a Reply