The Barcode Blog

A mostly scientific blog about short DNA sequences for species identification and discovery. I encourage your commentary. -- Mark Stoeckle

Subscribe to this blog

Sign up for email notifications

Visualizing large data sets

Growing barcode libraries challenge understanding. There are already about 200,000 mtCOI barcodes from about 25,000 species in BOLD, the Barcode of Life Data Systems Database. The burgeoning data sets hint at insights into biological diversity, revealed by looking at many species at once. A map of counties of the United States shows both large and small scale patterns, shaped by history, geography, and politics.  Viewed through the lens of mitochondrial variation, what would a map of species show? Are differences among and within species similar in birds and butterflies? Do species boundaries differ in marine vs terrestrial species, or in tropical vs. temperate zones? 



Here I offer one possible way of visualizing differences in barcode data sets using as an example the BOLD  “Hesperiidae of the ACG 1” Project containing 2,185 COI sequences from 355 species of skipper butterfly in ACG conservation area in Costa Rica (Hajibabei et al Proc Natl Acad Sci USA 2006 103:968). The BOLD-generated neighbor-joining tree of ACG Hesperiid COI sequences shown at left offers a traditional way of comparing sequences and is an essential step in looking at individual species and their close genetic neighbors. However the NJ tree contains only about 100 sequences from 20 species per page, and so runs to 22 pages. In the future it will likely be desirable to compare much larger data sets from, say, all 3700 known species of world skipper butterflies. 

For DNA barcoding, the essential information is differences among and within species. The higher-level groupings of species which are inevitably generated by a tree are of less interest. (In the following analysis distances are used simply to examine patterns of variation, NOT to determine whether they are sufficient for diagnosing species.) 

One useful approach is to generate histograms of differences within and between species. BOLD has a “Nearest Neighbor” analytic function which generates a table of mean and maximum variation within each species, “nearest neighbor” distance to the next closest species, and histogram summaries of the results.  

The histograms quickly show distances within most species are small and minimum distances between species are generally larger. Histograms are summaries with unlimited capacity. However, one might want to know more about individual species. For example, do species with higher intraspecific distances also show greater interspecific distances? One also wonders about the variation below 1% in both panels. In Beautiful Evidence, Edward Tufte points out histograms display relatively small amounts of data, usually 1 value per column. How to generate something with more information, more like the US Counties map, but not 22 pages long?

The graph at left uses the same 2 essential parameters: distance within each species and distance to nearest genetic neighbor. Because the usual distances within and between species are very different, plotting on a logarithmic scale allows one to inspect the variation in each set simultaneously. The results with 278 of the ACG skipper species (all those for which more than one individual was sampled, thereby generating a mean intraspecific distance) are shown. For each species, there is a black dot showing intraspecific distance and a red dot directly above or below showing distance to nearest neighbor. Sorting by intra- and interspecific distance allows the relative distances for each species to be seen. This graph highlights the relatively few species with nearest neighbor distances less than the mean intraspecific distance for that species. A line drawn at 1% appears to separate most of the intraspecific from interspecific values.

This graph is remarkably compressible, as shown by the small inset in the US county map above and in the figure at right. Here this is used to compare variation in Costa Rican skippers (278 species in 1 Family) to that in Australian fish (172 species in 1 Class) (Ward et al 2006 Phil Trans Royal Soc B 360:1471). The distribution of intraspecific variation seems quite similar while the nearest neighbor distances in fish are higher, presumably reflecting less dense sampling of a larger taxon. In the Fish paper, the red dots at bottom were thought to be ID errors, so perhaps some of the those in the skipper data set showing zero distance between species are taxonomic errors as well. This graphic approach could be useful in comparing patterning of intra- and inter-specific variation in marine vs terrestrial, tropical vs temperate, and allopatric vs sympatric species.


This entry was posted on Thursday, November 30th, 2006 at 11:23 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.


About this site

This web site is an outgrowth of the Taxonomy, DNA, and Barcode of Life meeting held at Banbury Center, Cold Spring Harbor Laboratory, September 9-12, 2003. It is designed and managed by Mark Stoeckle, Perrin Meyer, and Jason Yung at the Program for the Human Environment (PHE) at The Rockefeller University.

About the Program for the Human Environment

The involvement of the Program for the Human Environment in DNA barcoding dates to Jesse Ausubel's attendance in February 2002 at a conference in Nova Scotia organized by the Canadian Center for Marine Biodiversity. At the conference, Paul Hebert presented for the first time his concept of large-scale DNA barcoding for species identification. Impressed by the potential for this technology to address difficult challenges in the Census of Marine Life, Jesse agreed with Paul on encouraging a conference to explore the contribution taxonomy and DNA could make to the Census as well as other large-scale terrestrial efforts. In his capacity as a Program Director of the Sloan Foundation, Jesse turned to the Banbury Conference Center of Cold Spring Harbor Laboratory, whose leader Jan Witkowski prepared a strong proposal to explore both the scientific reliability of barcoding and the processes that might bring it to broad application. Concurrently, PHE researcher Mark Stoeckle began to work with the Hebert lab on analytic studies of barcoding in birds. Our involvement in barcoding now takes 3 forms: assisting the organizational development of the Consortium for the Barcode of Life and the Barcode of Life Initiative; contributing to the scientific development of the field, especially by studies in birds, and contributing to public understanding of the science and technology of barcoding and its applications through improved visualization techniques and preparation of brochures and other broadly accessible means, including this website. While the Sloan Foundation continues to support CBOL through a grant to the Smithsonian Institution, it does not provide financial support for barcoding research itself or support to the PHE for its research in this field.