How accurate are BARCODE databases?

DNA barcode databases are a kind of wikipedia of DNA identifiers, with contributions by thousands of researchers. How accurate are they?  How do records that meet the BARCODE standard compare to routine GenBank records? How many BARCODE records represent pseudogenes masquering as their functional counterparts?

In case you missed this, Kevin Kerr and I recently analyzed sequencing error among 11,000 avian BARCODEs representing 2,700 bird species (PLoS ONE e43992 2012), using a frequency matrix approach to look at patterns of variation. As illustrated below, we found that very low frequency nucleotide variants (VLFs) found in single individuals of a species (labeled “singletons” in figure) are strongly concentrated at the ends of the barcode segment, consistent with sequencing error.

In contrast, very low frequency variants found in two or more individuals of a species (labeled “shared” in figure) provided a nice control–these were relatively evenly distributed, consistent with biological origin. Not surprisingly, given that most of the very rare nucleotide variants were associated with amino acid substitutions, very rare amino acid variants showed the same distribution patterns.

In addition to analyzing sequencing error, we closely examined the small fraction (0.1%) of BARCODEs with multiple very low frequency variants shared among individuals of a species. Based on review of trace files deposited as part of BARCODE standard, these unusually divergent versions of COI turned out to be overlooked cryptic pseudogenes lacking stop codons!

We were able to calculate an error rate for the dataset, using observation that most (94%) second codon positions were >99.9% conserved, which meant that nearly all sequencing errors at second position sites would be detectable as very low frequency (<0.1%) variants. The calculated upper limit of sequencing error was 8 x 10-5 errors/nucleotide, which is 1-2 orders of magnitude higher than generally cited for direct Sanger sequencing of amplified DNA, but unlikely to compromise species identification.  Overall, we found about 3% of BARCODEs have 1 or more errors (ave 1.4). To our knowledge, this is the first assessment of sequencing error for a large public sequence database with multiple contributors. It might be useful to annotate those records with probable sequencing errors or that represent cryptic pseudogenes; I believe that annotation is possible in BOLD and not in GenBank.

In addition to confirming the high quality of the avian BARCODE database we were able to demonstrate significant quality improvement in avian BARCODE and non-BARCODE COI records deposited in GenBank over the past decade as shown at right (bars indicate 95% confidence interval).

The frequency matrix we describe has potential application for genetic database quality assessment, discovery of cryptic pseudogenes, and studies of low-level variation.

Our results were presented at the Third European Congress for the Barcode of Life (ECBOL3) held at Royal Flemish Academy of Belgium for Sciences and the Arts (KVAB) in Brussels in September (group photo below).

Powerpoint based on PLoS ONE article is available here: freq matrix stoeckle 8nov2012

Ocean Champion

On 26 October 2012, Monmouth University (New Jersey) named Jesse Ausubel the 2012 National Champion of the Ocean. We are honored to join lustrous company and appreciate that the award recognizes the work of the entire Census of Marine Life community. The award ceremony included an excellent seminar organized by Tony MacDonald of the Urban Coastal Institute featuring Admiral (Ret.) Paul Gaffney; Vice Admiral Richard Larrabee, USCG (Ret.), Port Commerce Director, Port Authority of NY/NJ; Lawrence Dickerson, President and CEO, Diamond Offshore Drilling; and Christopher Koch, President and CEO, World Shipping Council. We post Jesse’s talk on Wealth from Oceans.

National Ocean Champion awardees
2012 Jesse H. Ausubel, Co-Founder, Census of Marine Life; Alfred P. Sloan Foundation and The Rockefeller University
2011 Jean-Michel Cousteau, Founder, Ocean Futures Society
2010 Carl Safina, President and Co-Founder, Blue Ocean Institute
2009 Lillian C. Borrone, Former Executive Director, Port Authority of NY/NJ
2008 Representative James Saxton (NJ) and Shirley Pomponi, Executive Director, Harbor Branch Oceanographic Institute
2007 Jerry Schubel, President and CEO of the Aquarium of the Pacific, and Ted Ames, Director of the Lobster Hatchery in Stonington, Maine
2006 Robert Gagosian, past President and Director of the Woods Hole Oceanographic Institution
2005 Admiral James Watkins, Chair of the U.S. Commission on Ocean Policy and the Honorable Leon Panetta, Chair of the Pew Ocean Commission

Smithsonian exhibit about Census of Marine Life

The National Museum of Natural History of the Smithsonian Institution in Washington DC has added a small but excellent exhibit about the Census of Marine Life to the Sant Hall of Ocean Life. A couple of dozen CoML alumni participated in a ceremony to welcome the exhibit, which includes the splendid gold medal of the International Cosmos Prize awarded in 2011 to the Census Steering Committee.

JA CoML at Smithsonian

European Barcode Conference

With about 130 other experts from 28 countries, Mark Stoeckle and Jesse Ausubel attended the European Consortium for the Barcode of Life (ECBOL), Royal Belgian Institute of Natural Sciences, and Royal Museum for Central Africa 3rd ECBOL conference under the theme “Barcoding of Organisms of Policy Concern” at the Royal Flemish Academy of Belgium for Sciences and the Arts in Brussels. Mark presented his work on very low frequency nucleotide variants. Among many excellent presentations were reports on the flora of Wales, on orchids, and on forensic entomology. Thanks to Marc de Meyer, Thierry Backeljau and Pedro Crous for organizing the meeting.

Change in relation to Sloan Foundation

On the 1st of September, Jesse Ausubel and the Alfred P. Sloan Foundation changed their relationship. Jesse is now a Science Advisor to Sloan, responsible for the Deep Carbon Observatory as well as some follow through on projects for which Sloan grant-making has been completed (Barcode of Life initiative, Encyclopedia of Life, Census of Marine Life). Jesse served successively as a program officer, program director, and vice president for programs of Sloan from 1994-2012 and is enormously graceful for the opportunities Sloan has afforded.

Barcode – Frequency Matrix

PHE researcher Mark Stoeckle teamed up with Royal Ontario Museum ornithologist Kevin Kerr on a just published PLoS ONE paper “Frequency matrix approach demonstrates high sequence quality in avian BARCODEs and highlights cryptic pseudogenes.” Beginning with the Banbury conferences in 2003, a lot of effort has gone into ensuring the accuracy of DNA barcode databases, both in terms of minimizing sequencing error and having accurate taxonomic labels. How good are the resulting databases? Our paper confirms the high sequence accuracy of the avian BARCODE database (11,000 records), demonstrates significant quality improvement over the past decade, and flags 0.1% of records that are overlooked pseudogenes (which can be viewed as a kind of taxonomic error).

Bird Barcodes

We are delighted to learn that our 2004 PLoS Biology article, “Identification of birds through DNA barcodes,” is in the top 1% of Article-Level Metrics (ALMs) for all PLoS articles. ALMs are a “comprehensive set of metrics that track the reach and impact of published research” and are readily available for all PLoS articles. The ALMs for 2004 Birds show continuing interest over the past 8 years, with 38,141 article views, 8,516 downloads, and 630 citations in Web of Science.