The Barcode Blog

A mostly scientific blog about short DNA sequences for species identification and discovery. I encourage your commentary. -- Mark Stoeckle

Subscribe to this blog

Sign up for email notifications


Phylogenetically diverse COI dataset extends evidence that rare variants are often errors

In October 2012 Nature 490:535, Breen and colleagues reported on amino acid variation among 13 mitochondrial protein and 2 nuclear proteins based on alignments of 3,000-53,000 sequences representing 1,000 to 14,000 species. They found that on average, a given site in a protein accomodates 9 different amino acids. Based on the distribution of variants, they conclude that epistasis (interaction among genes) strongly constrains molecular evolution.

Here Kevin Kerr and I re-analyze their large COI dataset [19,000 sequences (8,300 human); 4,700 species], generously provided by senior author Fyodor Kondrashov. Our aim is to determine if the frequency matrix approach we applied to avian BARCODEs (PLoS ONE 2012 e:43992) can be used to identify errors in a more phylogenetically diverse dataset.  As the authors note, sequencing error is a potential confounder for their analysis; they used a different approach to assess error than we present here.

Brief methods. COI nucleotide alignment opened in MEGA, translated using appropriate table (~95% of COI dataset is insects or vertebrates), and exported to Excel; frequencies calculated at each amino acid position, and amino acid letter sequences converted into amino acid frequencies. For this analysis we defined rare variants as amino acids present in fewer than 0.02% (1/5000) sequences. In this dataset, rare variants comprised about half (46%) of the total amino acid diversity. For analyses illustrated below, we excluded the 8,281 human sequences, which had very few (8) rare variants.


As observed with avian BARCODEs, rare variants in this dataset were less common in newer sequences,  consistent with improved sequence quality over time.


Rare variants were associated with low quality sequences–those with internal N’s, generating unknown “X” amino acids.

Lastly, a thought experiment applying the error rate from our PLoS ONE paper suggests that significant artifactual amino acid diversity is expected when error rate x dataset size is equal to or greater than 1, conditions that may be met by large datasets particularly those containing older sequences as in this COI alignment.

These results reinforce our published observation that a frequency matrix approach is a useful and important tool for analyzing error among large datasets. We hope that others will utilize this approach.

Regarding the findings of Breen and colleagues, our re-analysis suggests that error makes a greater contribution to amino acid diversity in this dataset than that calculated by authors, although the main conclusion of their paper regarding epistasis would likely be unchanged.



This entry was posted on Friday, January 18th, 2013 at 7:22 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

About this site

This web site is an outgrowth of the Taxonomy, DNA, and Barcode of Life meeting held at Banbury Center, Cold Spring Harbor Laboratory, September 9-12, 2003. It is designed and managed by Mark Stoeckle, Perrin Meyer, and Jason Yung at the Program for the Human Environment (PHE) at The Rockefeller University.

About the Program for the Human Environment

The involvement of the Program for the Human Environment in DNA barcoding dates to Jesse Ausubel's attendance in February 2002 at a conference in Nova Scotia organized by the Canadian Center for Marine Biodiversity. At the conference, Paul Hebert presented for the first time his concept of large-scale DNA barcoding for species identification. Impressed by the potential for this technology to address difficult challenges in the Census of Marine Life, Jesse agreed with Paul on encouraging a conference to explore the contribution taxonomy and DNA could make to the Census as well as other large-scale terrestrial efforts. In his capacity as a Program Director of the Sloan Foundation, Jesse turned to the Banbury Conference Center of Cold Spring Harbor Laboratory, whose leader Jan Witkowski prepared a strong proposal to explore both the scientific reliability of barcoding and the processes that might bring it to broad application. Concurrently, PHE researcher Mark Stoeckle began to work with the Hebert lab on analytic studies of barcoding in birds. Our involvement in barcoding now takes 3 forms: assisting the organizational development of the Consortium for the Barcode of Life and the Barcode of Life Initiative; contributing to the scientific development of the field, especially by studies in birds, and contributing to public understanding of the science and technology of barcoding and its applications through improved visualization techniques and preparation of brochures and other broadly accessible means, including this website. While the Sloan Foundation continues to support CBOL through a grant to the Smithsonian Institution, it does not provide financial support for barcoding research itself or support to the PHE for its research in this field.