The Barcode Blog

A mostly scientific blog about short DNA sequences for species identification and discovery. I encourage your commentary. -- Mark Stoeckle

Subscribe to this blog

Sign up for email notifications

Worried taxonomists discover quality control

In 9 September 2008 Proc Natl Acad Sci USA researchers from Brigham Young University and University of South Carolina report that nuclear pseudogenes, if not excluded from analysis, can confuse COI DNA barcoding studies.  To my reading, this study re-iterates a well-understood hazard and proposes remedies that are already standard in most phylogenetic DNA work including DNA barcoding. 

Pseudogenes, first described by Jacq, Miller, and Brownlee in 1977, are non-functional genes that presumably arose from ancient duplication events and subsequent loss of function through accumulation of mutations. In sequencing studies, pseudogenes of protein coding genes are usually easily distinguished from their functional counterparts as they harbor insertions, deletions, and/or point mutations that interrupt the reading frame.

Pseudogenes derived from mitochondrial DNA, often called numts (nuclear copies of mtDNA) were first reported by Gellissen et al in 1983. A search of NCBI PubMed for “mitochondrial pseudogenes” shows 282 articles and 12 review articles over the past 25 years.

Song and colleagues analyzed mitochondrial COI sequences in grasshoppers (single individuals of four species representing different Acrididae subfamilies) and cave crayfish (119 individuals of four species in genus Orconectes collected at 56 localities in southeastern US).  Most of the analyses involved sequencing of cloned PCR products, which adds a level of complexity and is unlike any DNA barcoding study I am aware of. To skip to the conclusion, the authors emphasize that if numts generated by PCR amplification of mtCOI are NOT excluded, then it will confuse DNA barcoding or other phylogenetic studies. Since most of the numts generated in this study were easily recognized I do not understand why they did so much work (in all they sequenced 125 grasshopper clones and 560 crayfish clones) to reach this sensible but obvious conclusion.

First, grasshoppers. The authors amplified a subsegment of the COI barcode region (439 vs 648 bp in full-length barcode region; shorter amplicons are more likely to represent pseudogenes). The amplified products from the four individual grasshoppers were cloned, and 30 clones/species were sequenced, generating an average of 15 unique haplotypes per species. Of these, 97.3% had stop codons, meaning they could be immediately excluded as not representing true mtCOI sequences.  A full-length barcode sequence was amplified from 1 species, and cloned products yielded 19 paralogues (ie obvious pseudogenes).

Second, crayfish. The researchers amplified the full-length COI barcode region from 172 individuals using Folmer primers. “For 93 individuals, we were able to obtain clean COI sequences; however, 79 individuals from southern populations of O. australis and O. barri yielded ambiguous sequences.” To my reading, the next step would be to stop there and find different primers or PCR conditions that did not generate ambiguous sequences (indicating that more than one COI-like template was being amplified). Instead the authors proceeded to clone products from individuals that yielded ambiguous results and also from those with clean sequences “to determine whether numts were present but not being detected without cloning.” Not surprisingly, they found probable numts in all 4 species of crayfish, and interestingly some of the clones did NOT contain stop codons (ie might be mistaken for functional COI sequences). These apparent numts, which might be easily overlooked, came from the 2 species with ambiguous results on sequencing of uncloned products, which I take as further evidence that it would have been better to develop a different COI amplification protocol, assuming the goal is to accurately determine the barcode sequence.

bold figureAmong other quality control standards in Barcode of Life Database (BOLD), COI sequences with stop codons, such as found in most pseudogenes in this study, are automatically flagged, signalling the researcher to re-check the data.  

Finally, it may be that some of what the authors call numts instead reflect heteroplasmy, ie differences among individual mitochondrial DNAs. Like static noise generated when you turn the volume up all the way, cloning is likely to reveal various mutations in some of the 10^17 or so mitochondrial genomes present in eukaryotic organisms. Looking ahead, it seems to me that the authors have missed an opportunity to contribute protocols or sequences that could be applied by other researchers to DNA barcoding of grasshoppers or crayfish.

This entry was posted on Friday, September 5th, 2008 at 8:21 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “Worried taxonomists discover quality control”

  1. L’importance de la prise en compte des NUMTs dans le barcoding at Le blog du Doc’ Says:

    […] The Barcode of Life, et Song et coll., Many species in one: DNA barcoding overestimates the number of species when […]

  2. Hugo Mejia-Madrid Says:

    I found a paper that reviews nmt´s in Acanthocephalans and was not reviewed by Song et al. The paper appears in International Journal for Parasitology 36 (2006) 247–254 by Benesh et al. I found the protocol much more convenient than the one in Song et al.


About this site

This web site is an outgrowth of the Taxonomy, DNA, and Barcode of Life meeting held at Banbury Center, Cold Spring Harbor Laboratory, September 9-12, 2003. It is designed and managed by Mark Stoeckle, Perrin Meyer, and Jason Yung at the Program for the Human Environment (PHE) at The Rockefeller University.

About the Program for the Human Environment

The involvement of the Program for the Human Environment in DNA barcoding dates to Jesse Ausubel's attendance in February 2002 at a conference in Nova Scotia organized by the Canadian Center for Marine Biodiversity. At the conference, Paul Hebert presented for the first time his concept of large-scale DNA barcoding for species identification. Impressed by the potential for this technology to address difficult challenges in the Census of Marine Life, Jesse agreed with Paul on encouraging a conference to explore the contribution taxonomy and DNA could make to the Census as well as other large-scale terrestrial efforts. In his capacity as a Program Director of the Sloan Foundation, Jesse turned to the Banbury Conference Center of Cold Spring Harbor Laboratory, whose leader Jan Witkowski prepared a strong proposal to explore both the scientific reliability of barcoding and the processes that might bring it to broad application. Concurrently, PHE researcher Mark Stoeckle began to work with the Hebert lab on analytic studies of barcoding in birds. Our involvement in barcoding now takes 3 forms: assisting the organizational development of the Consortium for the Barcode of Life and the Barcode of Life Initiative; contributing to the scientific development of the field, especially by studies in birds, and contributing to public understanding of the science and technology of barcoding and its applications through improved visualization techniques and preparation of brochures and other broadly accessible means, including this website. While the Sloan Foundation continues to support CBOL through a grant to the Smithsonian Institution, it does not provide financial support for barcoding research itself or support to the PHE for its research in this field.