The Barcode of Life blog » Blog Archive

The Barcode Blog

A mostly scientific blog about short DNA sequences for species identification and discovery. I encourage your commentary. -- Mark Stoeckle

Subscribe to this blog

Biggest tree so far

Phylogenetic tree-building programs are the workhorses of evolutionary analysis. Thus it might be surprising that, given there are at least 1.7 million named species of plants and animals, output trees with over 1000 taxa are exceptional. The primary reason is computational–the number of possible arrangements rises logarithmically with input taxa (eg for 1000 taxa, ~10^2500 possible trees; Tamura et al 2004), such that standard algorithms, even those that sample a fraction of “tree space,” are too slow. As a result, so far the Tree of Life has been constructed by concatenating multitudes of trees each built with relatively small numbers of taxa. This is unsatisfying and possibly unreliable.

In May 2009 Cladistics researchers from Argentina and Sweden report on the largest tree to date–73,060 eukaryotic taxa, essentially everything Goloboff and colleagues could find in GenBank, ranging from algae and protozoans to flowering plants and vertebrates. In addition to size, there were several remarkable features. The tree was constructed from just 13 genes, each of which was sequenced for a subset of the total (750 to ~20,000 taxa), plus 604 morphologic characters that applied across most of the data set. Nearly all (92%) of the cells in the resulting data matrix (73,060 taxa x 9535 characters) were empty due to lack of data. Nonetheless, the parsimony analysis recovered most eukaryotic groups down to the level of order as monophyletic taxa. The analysis utilized TNT software previously developed (and made publicly available) by Goloboff and colleagues and took 2.5 months on 3 desktop computers (total 96 GB RAM, 16 x 3 Ghz processors). To manage the flow of data, nearly all steps were automated from extracting, labeling, and aligning GenBank sequences to analyzing monophyly of groups at various taxonomic levels.

Looking ahead, the authors see biggest challenges not in tree-building, but in alignment software and “that the sequence information required is simply non-existent, and the morphological information is scanty and fragmentary.” I know that a short segment of a single mitochondrial gene is considered insufficient for phylogeny, but it would be interesting to see what TNT could do with 40,777 COI sequences from 6,506 fish species (FishBOL), for example. I imagine that even TNT might have trouble analyzing all 603,002 COI sequences of the 57,159 species represented in BOLD (with many more to come). Phylogenetic trees are established as the goal of evolutionary analysis, but we may need alternate methods for analyzing differences and similarities in very large data sets.

This entry was posted on Tuesday, May 26th, 2009 at 9:31 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “Biggest tree so far”

Marcos Mirande Says:
May 27th, 2009 at 9:16 am
Hello:

I do not see the point in analyzing more than one sample of each species, given that sequenced ones were (I guess) identified by (true and competent) ichthyologists (in the case of fishes, obviously). And at this point, to analyze 6505 seqs in TNT is much easy if the sequences are aligned; as you noted, the main problem with such analyses is the alignment. Even to analyze the 40777 seqs is easier (or much easier depending on the structure of the data) than the analysis we did.

However, I was surfing a little the FishBol site, and I didn’t find the sequences… Are they publicly available?

Cheers; Marcos.
Pablo Goloboff Says:
May 27th, 2009 at 12:23 pm
Dear Mark,

thanks for the post.

One of the things we discuss in the paper is that the missing
entries in our matrix make it run much slower –a lot of the
speed in TNT comes from figuring out data structure, which is
harder for so many missing entries. After this analysis which
just came out, I’ve tried the 218,348 prokaryotes Morgan Price
posted up in http://www.microbesonline.org/fasttree. Even if
there’s 2 1/2 times more taxa (when you delete the ca. 23,000
duplicates), completing a wagner tree plus TBR takes about 24 hs.,
even shorter than for our matrix –fewer characters (1287), and
fewer missing entries, thus providing a more structured data set.
While our matrix swaps at a speed of 130 x 10^6 rearrangs/sec,
the taller but narrower and better structed prokaryote matrix
swaps at about 2700 x 10^6 rearrangs/sec.

So, given that the barcoding data sets would make a narrower matrix,
with few or no missing entries, perhaps it would not be so hard to run
them with TNT. I would be interested in giving it a try, if I get an
aligned matrix. And I know that is a big “if”. Also, to interpret
results, the taxon names would have to be in the proper format
(see our paper for details) –visual inspection of the trees is
out of the question.

Just two small inaccuracies in your account: our analysis used about
2.5 months of processor time, but our small cluster had 16
processors; the analysis of each of the two matrices (the combined,
and the molecules alone) was completed in a couple weeks.

And second, none of our coauthors is german –I guess you meant Steve
Farris. He was born in the U.S. but has lived in Sweden for
more than 15 years. So, that would make “two swedes” on the team.
Mark Stoeckle Says:
May 27th, 2009 at 7:19 pm
Pablo, thanks for your response. Increased speed TNT algorithm with more complete data sets (which would apply to DNA barcode data as you note) is helpful and important to know. Regarding format of COI barcode sequences in BOLD, at present the downloadable fasta files are aligned and have species names but not names for higher taxonomic levels, although your analysis makes it clear that would be a desirable upgrade. Thanks for other corrections–entry edited!
Mark Stoeckle Says:
May 27th, 2009 at 7:38 pm
Marcos, regarding fish COI barcode sequences, those publicly available (hopefully many more to be released in near future) are accessible at BOLD homepage http://www.barcodinglife.org under “Public Projects.” Regarding multiple sequences for species, I see two reasons to include these. First, given the primary rationale for compiling barcodes is to enable DNA-based species identification, then testing the whole set to document those that form independent clusters (and to highlight those that do not) is important. Second, for any taxonomic group identification of some specimens (in fish, larval forms are particularly difficult for example) can be difficult even for experts, plus there are groups with taxonomic uncertainty as to species limits. Also as in any database particularly those with multiple submitters, clerical errors may occur. So comprehensive analysis that flags those sequences that do not form independent coherent groups with with other members of species is important recurrent process in curating database and improving species-level taxonomy.

Contact: mark.stoeckle@rockefeller.edu

About this site

This web site is an outgrowth of the Taxonomy, DNA, and Barcode of Life meeting held at Banbury Center, Cold Spring Harbor Laboratory, September 9-12, 2003. It is designed and managed by Mark Stoeckle, Perrin Meyer, and Jason Yung at the Program for the Human Environment (PHE) at The Rockefeller University.

About the Program for the Human Environment

The involvement of the Program for the Human Environment in DNA barcoding dates to Jesse Ausubel's attendance in February 2002 at a conference in Nova Scotia organized by the Canadian Center for Marine Biodiversity. At the conference, Paul Hebert presented for the first time his concept of large-scale DNA barcoding for species identification. Impressed by the potential for this technology to address difficult challenges in the Census of Marine Life, Jesse agreed with Paul on encouraging a conference to explore the contribution taxonomy and DNA could make to the Census as well as other large-scale terrestrial efforts. In his capacity as a Program Director of the Sloan Foundation, Jesse turned to the Banbury Conference Center of Cold Spring Harbor Laboratory, whose leader Jan Witkowski prepared a strong proposal to explore both the scientific reliability of barcoding and the processes that might bring it to broad application. Concurrently, PHE researcher Mark Stoeckle began to work with the Hebert lab on analytic studies of barcoding in birds. Our involvement in barcoding now takes 3 forms: assisting the organizational development of the Consortium for the Barcode of Life and the Barcode of Life Initiative; contributing to the scientific development of the field, especially by studies in birds, and contributing to public understanding of the science and technology of barcoding and its applications through improved visualization techniques and preparation of brochures and other broadly accessible means, including this website. While the Sloan Foundation continues to support CBOL through a grant to the Smithsonian Institution, it does not provide financial support for barcoding research itself or support to the PHE for its research in this field.

Marcos Mirande Says:
May 27th, 2009 at 9:16 am

Hello:

I do not see the point in analyzing more than one sample of each species, given that sequenced ones were (I guess) identified by (true and competent) ichthyologists (in the case of fishes, obviously). And at this point, to analyze 6505 seqs in TNT is much easy if the sequences are aligned; as you noted, the main problem with such analyses is the alignment. Even to analyze the 40777 seqs is easier (or much easier depending on the structure of the data) than the analysis we did.

However, I was surfing a little the FishBol site, and I didn’t find the sequences… Are they publicly available?

Cheers; Marcos.

Pablo Goloboff Says:
May 27th, 2009 at 12:23 pm

Dear Mark,

thanks for the post.

One of the things we discuss in the paper is that the missing
entries in our matrix make it run much slower –a lot of the
speed in TNT comes from figuring out data structure, which is
harder for so many missing entries. After this analysis which
just came out, I’ve tried the 218,348 prokaryotes Morgan Price
posted up in http://www.microbesonline.org/fasttree. Even if
there’s 2 1/2 times more taxa (when you delete the ca. 23,000
duplicates), completing a wagner tree plus TBR takes about 24 hs.,
even shorter than for our matrix –fewer characters (1287), and
fewer missing entries, thus providing a more structured data set.
While our matrix swaps at a speed of 130 x 10^6 rearrangs/sec,
the taller but narrower and better structed prokaryote matrix
swaps at about 2700 x 10^6 rearrangs/sec.

So, given that the barcoding data sets would make a narrower matrix,
with few or no missing entries, perhaps it would not be so hard to run
them with TNT. I would be interested in giving it a try, if I get an
aligned matrix. And I know that is a big “if”. Also, to interpret
results, the taxon names would have to be in the proper format
(see our paper for details) –visual inspection of the trees is
out of the question.

Just two small inaccuracies in your account: our analysis used about
2.5 months of processor time, but our small cluster had 16
processors; the analysis of each of the two matrices (the combined,
and the molecules alone) was completed in a couple weeks.

And second, none of our coauthors is german –I guess you meant Steve
Farris. He was born in the U.S. but has lived in Sweden for
more than 15 years. So, that would make “two swedes” on the team.

Mark Stoeckle Says:
May 27th, 2009 at 7:19 pm

Pablo, thanks for your response. Increased speed TNT algorithm with more complete data sets (which would apply to DNA barcode data as you note) is helpful and important to know. Regarding format of COI barcode sequences in BOLD, at present the downloadable fasta files are aligned and have species names but not names for higher taxonomic levels, although your analysis makes it clear that would be a desirable upgrade. Thanks for other corrections–entry edited!

Mark Stoeckle Says:
May 27th, 2009 at 7:38 pm

Marcos, regarding fish COI barcode sequences, those publicly available (hopefully many more to be released in near future) are accessible at BOLD homepage http://www.barcodinglife.org under “Public Projects.” Regarding multiple sequences for species, I see two reasons to include these. First, given the primary rationale for compiling barcodes is to enable DNA-based species identification, then testing the whole set to document those that form independent clusters (and to highlight those that do not) is important. Second, for any taxonomic group identification of some specimens (in fish, larval forms are particularly difficult for example) can be difficult even for experts, plus there are groups with taxonomic uncertainty as to species limits. Also as in any database particularly those with multiple submitters, clerical errors may occur. So comprehensive analysis that flags those sequences that do not form independent coherent groups with with other members of species is important recurrent process in curating database and improving species-level taxonomy.

Program for the Human Environment