Blog

Goldilocks finds mtDNA COI barcode length “just right” for distinguishing most animal species, asks why

The standard animal barcode 648 bp of mitochondrial gene COI seems “just right” for delimiting most animal species. If it were “too short”, then closely-related species would not be resolved. If it were “too long” then sequencing effort would be wasted. Here I examine what might underlie the Goldilocks effect.

The following figure looks at how often closely-related species (differing by .5%, 1%, or 2%) are predicted to have overlapping sequences. With the assumptions examined below, above 600 base pairs all but the most-closely-related species will be distinguished, and above 800 base pairs, there is little gain in sensitivity.

 

The assumptions underlying this table-napkin analysis appear supported by data so far:  

First, mitochondrial DNA sequence differences between closely-related species are widely and relatively evenly distributed throughout the protein coding and ribosomal genes. For example, see an earlier post with percent identity plots comparing whole mitochondrial genomes for congeneric salamanders. Further support is provided by a plot of parallel sequence differences in the 2 most commonly utilized mitochondrial genes, COI and cytB. 

Second, most closely-related animal species have COI sequences that differ by at least 1%. For example more than 98% of 13,320 congeneric pairs from a wide array of invertebrate and vertebrate species showed greater than 2% sequence difference (Hebert et al 2003 Proc Biol Sci 270:S96).  

Third, intraspecific sequence variation in mtDNA is generally very low, less than 1% in most animal species.

If most closely-related species can be distinguished by short mtDNA sequences, then recognizing the sets of mtDNA sequences that make up species, ie species delimitation, should at least sometimes be simple.  Using the neighbor joining tree of mtDNA barcodes below, an untrained person might pick out the groups of sequences that correspond to species. The top 5 groups represent previously unrecognized cryptic species of scorched mussel Brachidontes exustus (Lee and Foighil 2004 Mol Ecol 13:3527)

Goldilocks leaves us with the scientific questions: why are differences within most species so small, and why are the distances between most nearest neighbor species so large?

 

“Tag sequencing” reveals vast microbial diversity

A “tag sequencing” approach analogous to mtCOI sequencing for barcoding multicellular organisms reveals vast numbers of very rare, highly divergent, deep sea microbes.  In August 2006 PNAS (Sogin et al Proc Natl Acad Sci USA 32:12125), researchers from Marine Biological Laboratory at Woods Hole and Royal Netherlands Institute for Sea Research report on pooled bacterial samples collected at 550-1,710 meters in the Atlantic Ocean. To enable detection of rare populations, they focused on a short hypervariable region of 16s rRNA (only 79 bases) and analyzed a large number of PCR amplicons (118,000!) using 454 Life Sciences technology. This approach makes it economical to analyze enormous numbers of sequences from pooled environmental samples and avoids possible selection artifacts due to biases in amplifying longer PCR products and in cloning. Remarkably, the very short 79 base pair “tag” captured about 90% of the sequence differences in full-length 1500 base pair 16s rRNA sequences.

 

Sogin et al 2006 PNAS

The results were compared to a V6 hypervariable region database, which contains about 40,000 unique V6 sequences extracted from the nearly 120,000 published bacterial rRNA gene sequences. A small number of sequence tags similar to known bacteria made up most of the samples, including 25% that were identical to sequences in the database and 40% that were no more than 3% different. Overall 75% of “total tags” were less than 10% different from previously sequenced bacteria. The remaining 25% was comprised of thousands of low abundance, extraordinarily diverse populations. The authors conclude the “rare biosphere” is “an ancient and..nearly inexhautible source of genomic innovation..[that] at different times in earth’s history..may have had a profound impact on shaping planetary processes.”  There is a lot more we will learn through standardized genetic analysis using short sequences, including mtCOI barcodes and v6 rRNA tags, applied to vast numbers of organisms.

Oggi

The Italian weekly magazine Oggi published a colorful spread about the Census of Marine Life in its 26 June 2006 edition entitled “ET swims in the abyss.” The opening page includes a photo of Jesse about to go to work on an arctic ice floe.

Scientific American SuperGrid

The Continental Supergrid is featured in the July 2006 issue of Scientific American:

ENERGY
A Power Grid for the Hydrogen Economy
By Paul M. Grant, Chauncey Starr and Thomas J. Overbye
Cryogenic, superconducting conduits could be connected into a
“supergrid” that would simultaneously deliver electrical power
and hydrogen fuel.

We are delighted that the Supergrid concept has advanced strongly from the November 2002 and October 2004 feasibility workshops.

Elektron Splicer

Our 1996 paper Elektron introduced the notion of an appliance, “The Splicer”, a multipurpose minigenerator at the level of the individual consumer that could produce heat, electricity, and cold on demand, and help the electric power system for peaking and standby and thus reliability. We are delighted to find engineer Andreas Gratzer of the Austrian company OMV has developed a working splicer of 70,000 rpm and 115 kw. A German report details his thinking.

Barcode libraries grow on the web

The All Birds Barcoding Initiative (ABBI) website barcodingbirds.org provides a continuously updated progress report on barcoding world birds. A live feed matches barcodes deposited in the

Barcode of Life Data Systems (BOLD) to a checklist of world birds.  barcodingbirds.org visitors can view world and regional progress reports, progress by orders and families, and detailed results for individual species including zoomable Google world maps showing where barcodes were collected.  A link out to species pages in Integrated Taxonomic Information System (ITIS) is provided.

A sibling website fishbol.org provides live updates for the Fish Barcode of Life initiative (FishBOL) which aims to collect barcodes from all fishes, approximately 30,000 species.

In addition to assisting researchers scattered across the globe track progress and coordinate efforts, these sites will interest many other persons. They link an enormous amount of taxonomic information with growing genetic databases derived from museum collections. The instant Google maps provide a early glimpse of what these sites can do.  Future tools will overlay genetic differences in mitochondrial DNA barcodes on top of the geographic map. These “mashups” of traditional taxonomy, widely-accessible species identification through genetic barcode analysis, and user-friendly visualization will have many viewers.

Minimalist DNA barcodes to help with museum specimens

Analyzing shorter barcode sequences is an inexpensive way to link museum specimens with degraded DNA to the barcode database. In Molecular Ecology Notes July 2006, Hajibabaei et al first demonstrate in silico that COI sequences as short as 109 base pairs contain enough information to assign most specimens to known species, using simulated “minibarcodes” taken from two full-length barcode datasets. The researchers then analyzed the recovery and performance of various lengths of “minibarcodes” amplified from 33 dried and 91 ethanol-preserved insect specimens ranging in age from 1 to 21 years. As shown in the below, although full-length barcodes were recovered from only 24-39% of specimens, there was encouragingly high success amplifying shorter segments. 

 

As expected from the in silico analysis, in most cases species could be distinguished as well as with full-length barcodes, ie sequences formed distinct non-overlapping clusters in a NJ tree. Hajibabaei et al’s results indicate that analyzing shorter minibarcode sequences can link museum specimens with degraded DNA to the gold standard full-length barcode database. Rather than spend time and money optimizing primers and amplification conditions on individual specimens, instead apply a general method that recovers a 100-400 bp fragment. They point out this approach will be useful “when barcoding reveals several cryptic species within what had been viewed as one species, and it is not morphologically evident which of them matches the holotype” and as “a cost-effective way of building barcode libraries with broad geographical coverage”. They caution that “very short barcode sequences are..valuable for the identification of old specimens from SELECTED NARROW taxonomic arrays” (emphasis added)

I agree a mini-barcode approach can be useful in certain situations, and emphasize their caution that it is not a substitute for a standardized full-length barcode database. First, if widely used, a minimalist approach could easily devolve into a Tower of Babel, with a hodgepodge of non-overlapping minibarcodes that cannot be compared to each other. Second, even if the minibarcodes were standardized so they all overlapped, a simple calculation implies that they would lump together most species with less than 1% sequence difference (in birds, this is about 15% of species). Less than 1% sequence difference means less than 6.5 diagnostic differences with a full-length barcode, and assuming randomly distributed substitutions, a shorter barcode could easily fail to capture any diagnostic differences. 

Some taxonomists begin to worry less

In 21 June 2006 Heredity News and Commentary “DNA barcodes: recent successes and future prospects” Dasmahapatra and Mallet describe the DNA barcoding initiative as “plausible and worthwhile” and conclude that “recent studies convincingly demonstrate the efficacy of DNA barcoding to recover biologically significant groupings or species”. Their generally positive review stumbles near the end with a call “to supplement the mtDNA-based barcode with nuclear barcodes.” This is an impractical proposal of uncertain benefit. First of all, routinely adding a “nuclear barcode”, if one were to be found, would be solving a problem that does not exist, as there are few cases so far in which an mtDNA COI barcode does not distinguish closely-related species. Of course these exceptional cases need further taxonomic study “integrating DNA sequencing, morphology, and ecologic studies”. Secondly, although over 30 years of research demonstrate the broad utility of mtDNA in delimiting animal species, no one has yet identified a nuclear locus that can regularly distinguish closely-related species, as Dasmahapatra and Mallet acknowledge. 

 Most of the topics in their review are analyzed in our 2005 brochure “Barcoding Life, Illustrated” which outlines the benefits and limitations to DNA barcoding, including a section on “Why barcode animals with mitochondrial DNA?” . 

Establishing a DNA standard barcode for land plants

Kew Scientist April 2005Plant researchers from 11 world herbaria are investigating DNA regions for their potential as barcodes for land plant species. From the project rationale: “although the mitochondrial gene region, CO1 ( cox1 ), has already been used with considerable success across a range of animal groups and shows promise in at least some algal groups, it is characterized by relatively low rates of sequence divergence in land plants. Mitochondrial DNA in land plants also undergoes rearrangements, exhibits incorporation of foreign genes and frequent transfer of some genes to the nuclear genome. It is therefore desirable to find an alternative region or, if necessary, regions from one of the other genomes that would be suitable as a barcode.”

This project aims to establish a standard DNA barcode for land plants. Phase 1, completed in December 2005, was a survey of regions that have potential as land plant barcodes. Phase 2, to be completed by January 2007, is to “ground test” the most promising regions in a series of parallel case studies that incoporates representatives of all major land plant lineages.

If this competition is successful, it should be relatively straightforward (ie fast and inexpensive) to compile a comprehensive library of plant DNA barcodes, as there are only about 500,000 known plant species, the world catalog is thought to be essentially complete, and there are several herbaria with large specimen and DNA collections.