Biggest tree so far

Phylogenetic tree-building programs are the workhorses of evolutionary analysis. Thus it might be surprising that, given there are at least 1.7 million named species of plants and animals, output trees with over 1000 taxa are exceptional. The primary reason is computational–the number of possible arrangements rises logarithmically with input taxa (eg for 1000 taxa, ~10^2500 possible trees; Tamura et al 2004), such that standard algorithms, even those that sample a fraction of “tree space,” are too slow. As a result, so far the Tree of Life has been constructed by concatenating multitudes of trees each built with relatively small numbers of taxa. This is unsatisfying and possibly unreliable.

In May 2009 Cladistics researchers from Argentina and Sweden report on the largest tree to date–73,060 eukaryotic taxa, essentially everything Goloboff and colleagues could find in GenBank, ranging from algae and protozoans to flowering plants and vertebrates. In addition to size, there were several remarkable features. The tree was constructed from just 13 genes, each of which was sequenced for a subset of the total (750 to ~20,000 taxa), plus 604 morphologic characters that applied across most of the data set. Nearly all (92%) of the cells in the resulting data matrix (73,060 taxa x 9535 characters) were empty due to lack of data. Nonetheless, the parsimony analysis recovered most eukaryotic groups down to the level of order as monophyletic taxa. The analysis utilized TNT software previously developed (and made publicly available) by Goloboff and colleagues and took 2.5 months on 3 desktop computers (total 96 GB RAM, 16 x 3 Ghz processors). To manage the flow of data, nearly all steps were automated from extracting, labeling, and aligning GenBank sequences to analyzing monophyly of groups at various taxonomic levels.

Looking ahead, the authors see biggest challenges not in tree-building, but in alignment software and “that the sequence information required is simply non-existent, and the morphological information is scanty and fragmentary.” I know that a short segment of a single mitochondrial gene is considered insufficient for phylogeny, but it would be interesting to see what TNT could do with 40,777 COI sequences from 6,506 fish species (FishBOL), for example. I imagine that even TNT might have trouble analyzing all 603,002 COI sequences of the 57,159 species represented in BOLD (with many more to come). Phylogenetic trees are established as the goal of evolutionary analysis, but we may need alternate methods for analyzing differences and similarities in very large data sets.

Jesse awarded an honorary doctorate

Dalhousie University bestows an honorary doctorate on Jesse, really an honor for everyone who has contributed to the work of the ‘Program for the Human Environment’ for the past 20 years.  We post Jesse’s Convocation address, titled “Son et lumiere“, discussing environmental dimensions of sound and light.

p.s. On 23 May 2009 Anne McIlroy of the Toronto Globe and Mail reported on Jesse’s address (p. F5)

Oceans speak volumes. Sound spreads widely in the world’s oceans, and the clamour of human activity reaches every cove, says Jesse Ausubel, director of the Human Environment program at Rockefeller University in Manhattan.

“Motors and propellers are noisy; so are jet skis and oil-and-gas exploration. In fact, we make the oceans three decibels noisier each decade”, he says. In a convocation address this week at Dalhousie University in Halifax, Dr. Ausubel proposed turning down the volume for four hours in an International Quiet Ocean Experiment. That would be enough time for thousands of researchers around the world to see how sea creatures respond to pre-industrial noise levels, he says. Would whales, for example, change the frequencies they use to communicate? If we can quiet things down, would they return to their normal, natural frequency rather than deepening their voices or raising their voices? he said an interview.

Dr. Ausubel has experience with ambitious, large-scale scientific projects. He played an important role in creating the Encyclopedia of Life, an online catalogue of the species on Earth, and was also involved in establishing the Census for Marine Life, an international program to chart life in the oceans by 2010.

Scientists from around the world who are interested in his Quiet Ocean Experiment will get together for their first meeting before the end of the year .

Dr. Ausubel acknowledges how difficult it will be to get four noise-free hours. Navies and the world’s maritime industries would have to be on board. “Maybe the time to do it would be Christmas Day,” he says. “We would like to inconvenience people as little as possible”.

Potatoes challenge taxonomists

In 7 May 2009 Amer J Botany, David Spooner, scientist at USDA and University of Wisconsin, applies DNA barcoding to wild potatoes. According to the author, “the taxonomy of sect. Petota [section Petota is a subdivision within genus Solanum which comprises wild and domesticated potatoes] is complicated by interspecific hybridization, introgression, allopolyploidy, a mixture of sexual and asexual reproduction and possible recent species divergences.” As an aside, this one genus Solanum contains over 1500 species, including such seemingly diverse plants such as nightshades, horsenettles, tomatoes, and eggplants. While the most speciose bird genera, for example, have fewer than 100 species, Solanum is one of at least 50 plant genera with over 500 species (Pelser et al 2002 Am J Botany). Such large genera are unwieldy for constructing phylogenies and testing DNA-based identification methods–do they reflect biological differences in rates of speciation among genera, or a lack of phylogenetic knowledge? 

The above summary of Petota taxonomy is an understatement of the confusion regarding species boundaries in wild potatoes. For one, the apparent number of taxa seems to be shrinking rapidly: “an account of post-1990 taxonomic decisions of many workers published in Spooner and Salas (2006) reduced the 232 species of Hawkes (1990) to 190, but a taxonomic decision in my laboratory is converging on about 110 species.” Second, experts can be perplexed: “members of the complex are so similar that even experienced potato taxonomists…provided different identifications for identical collections numbers of the Solanum brevicaule complex in fully 38% of cases.” Third, genetic analysis (including multiple studies in the author’s laboratory) has been little help so far: “single- to low-copy nuclear restriction fragment length polymorphism (nRFLPs) and random amplified fragment length (RAPD) data…and amplified fragment length polymorphism (AFLP) data failed to clearly differentiate many wild species in the complex.” Independent work by researchers in the Netherlands (Jacobs et al 2008) similarly documents a challenging lack of concordance between genetics and taxonomy in Petota sp. Jacobs and colleagues performed AFLP analysis (this screens the entire nuclear genome) on 951 accessions representing 196 Petota species. Of the 196 taxa, multiple accessions of species clustered together in 58 cases, 38 formed multiple clusters, and 48 were mixed with accessions of other species. Regarding higher-level groupings, these researchers found absence of support for 4 Petota clades proposed by Spooner and colleagues, and conclude that recent speciation and high levels of hybridization will likely challenge attempts to create a genetic taxonomy of wild potatoes. 

Given the above background, one might guess that a minimalist approach (ie DNA barcoding) using 2 or 3 plastid genes might not distinguish among Petota species whose underlying taxonomy and genetics are so jumbled. Thus I am puzzled why the author went to the trouble of performing this study, and why, having set out to do so, he analyzed only a single plastid gene (trnH-psbA spacer) when all recent plant barcoding studies I am aware of are based on a combined analysis of 2 or 3 plastid genes. The author also analyzed ITS nuclear gene segment (approximately 800 nucleotide segment containing ITS 1, 5.8S rRNA, and ITS2). This is interesting, although for some reason the phylogenetic analysis looked at ITS segment and trnH-psbA individually. I believe there is general understanding that a single barcode region will not suffice for distinguishing land plants. Lastly, I am puzzled why only 23 of 63 species analyzed were represented by multiple accessions. The author asserts “many barcoding studies lack robust assessments of intraspecific polymorphism or assessments of all species within a genus that are needed to assess the species-specific nature of barcodes;” as a general criticism I believe this comment is incorrect, but it does apply to the present study.  

To summarize the study, 104 accessions of 63 Petota species plus 10 accessions of 9 outgroup species were analyzed (the author does not comment as to whether the selections are drawn from the revised total of 110 Petota species as defined in his laboratory). Regarding ITS, 23 species were represented by more than one accession; of these 10 species formed monophyletic lineages, which seems surprisingly good species-level resolution for a single marker in plants. With trnH-psbA, 17 species were represented by more than one accession; of these only 2 formed separate clades (1 of which did not form a distinct clade with ITS); as above, combined analysis was not done. The author dismisses matK on the basis of two previously published sequences for Petota sp. Finally, the trees used parsimony not neighbor-joining, the latter being the usual first-pass method of looking at barcode data. I find this paper a haphazard assessment of DNA barcoding in a taxonomically intensively-studied but poorly understood group. 

High rates of horizontal gene transfer in archaea and eubacteria mean that it is not possible to draw clear species boundaries. It may be that relationships among potato species are similarly complex, and that species boundaries are fuzzier than the current taxonomy of morphologically-defined species would suggest.  It seems to me that more taxonomic and genetic work is needed on this important group, including better tests of barcoding with combined analysis of 2 or 3 of the standard plastid regions in multiple accessions from a larger number of species. The goal of a standardized minimalist approach to identifying species, including wild potatoes, is important to help move beyond having only experts being able to identify plant species.

A diversity of open access DNA barcoding articles

The entire May 2009 Mol Ecol Res “Special Issue on Barcoding Life” is open access, thanks to support from Genome Canada and NSERC. As an aside, Mol Ecol Res publisher Wiley-Blackwell, which puts out over 1400 journals, charges $3000 US per article for open access, as compared to, for example, $1300 in PLoS ONE (all articles open access), and $1200 (plus $70/page) for open access option in Proc Natl Acad Sci USA. If funders mandate open access for publications based on research they support, then either this differential will disappear, or many manuscripts will migrate to lower cost journals. The special barcoding issue is based on Canadian Barcode of Life Network Scientific Symposium held at the Royal Ontario Museum in April 2008 and includes 27 articles on topics ranging from methodology to applications in creatures great and small including fungi and plants.

Most DNA barcoding analyses look at DNA identification through the lens of established taxonomy, ie how well does sequence data capture the species-level taxonomic categories established by morphologic analysis? In the special issue article “DNA barcoding and the mediocrity of morphology” researchers from York University and University of Guelph look at the comparison the other way around–how well does morphology identify the sorts of specimens that can be distinguished by DNA-based methods, barcoding in particular? In Packer and colleagues’ analysis, morphology comes up short “in numerous important situtations such as the association of larvae with adults and discrimination among cryptic species.” Taking an example not entirely at random, the authors analyze a key to Agathidium genus slime mold beetles co-authored by a sometime skeptic of barcoding (Miller and Wheeler, 2005) (this key made popular news as 3 of newly described beetles were named in tribute to then current US government officials–A. bushi, A. cheneyi, A. rumsfeldi). As is common in keys to insect identification, the reliance on adult male characters, usually genitalia, means that females and immature forms often cannot be identified to species (for the 3 USG namesakes, the key states “female not examined” and there is no description of immature forms). Again typical of insect keys, there is no documentation of intraspecific variation in diagnostic characters (for A. cheneyi, “the holotype is the only specimen examined of this species”). As a result, Packer and colleagues note “the morphological equivalent of the barcode gap that enables molecular identification of species cannot be calculated using traditional approaches, and the sample size of illustrations upon which measures of intraspecific variation might be estimated usually averages one per species with zero variance.”

I hope that future keys for slime mold beetles will include DNA barcode sequences. This will enable anyone, scientists and public alike, with access to a DNA sequencer to identify A. cheneyi adults of both sexes, larvae, fragments in the guts of predators, and perhaps eggs in random leaf litter samples.

Coaxing DNA out of ancient insects and sediments

Deep space telescopes gather light from the early universe, providing pictures of the unimaginably remote past. What about the biological universe–can we peer back in time? Geochemical evidence suggests life on Earth arose about 3.5 billion years ago and fossils reveal what life looked like as far back as 3.0 billion years, and important fossil discoveries across that whole span of time continue to be made. What about DNA?  As Carl Woese first realized, DNA sequences of living organisms contain signatures of their evolutionary relationships, and enable reconstructing history as far back as the origin of replication, even before cells and DNA. At the near end of the time scale, recovery of DNA from historical samples can help identify organisms that lived hundreds, thousands, tens of thousands, or even, in a few cases so far, hundreds of thousands years ago.

In April 2009 PLoS ONE ten researchers from university centers in Denmark, United Kingdom, United States, Canada, Russia, and New Zealand report on non-destructive recovery of diagnostic DNA from ancient insect specimens. As an aside, PLoS ONE is an important sea change in scientific publishing. First of all, as described on their website, the journal “features reports of original research from all disciplines within science and medicine. By not excluding papers on the basis of subject area, PLoS ONE facilitates the discovery of the connections between papers whether within or between disciplines.” Second, it puts the judgement of importance in the hands the scientific community where it belongs: 

“Too often a journal’s decision to publish a paper is dominated by what the Editor/s think is interesting and will gain greater readership — both of which are subjective judgments and lead to decisions which are frustrating and delay the publication of your work. PLoS ONE will rigorously peer-review your submissions and publish all papers that are judged to be technically sound. Judgments about the importance of any particular paper are then made after publication by the readership (who are the most qualified to determine what is of interest to them).”

This is so sensible it is surprising it has not happened earlier! There is of course a place for journals like Nature and Science, but I expect that a great deal of scientific publishing will migrate to PLoS ONE, with benefits to the authors and the scientific community.  

Back to the paper. Thomsen and colleagues first tested a non-destructive extraction method (Gilbert et al 2007 PLoS ONE 2:e272) on museum beetle specimens. This involves overnight incubation with gentle agitation in a digestion buffer at 55^o C. Remarkably, the specimens emerged none the worse for the wear. The researchers recovered 77-204 bp segments of mtCOI from all of 20 beetles, which were collected as early as 1825 (1/3 were over 100 years old). Using a Bayesian approach that generates taxonomic assignments with probability estimates, these short fragments were sufficient for identification to species in most cases; the remainder could be assigned to family or genus level. The researchers then applied this same technique to insect chitin (exoskeleton) fragments preserved in permafrost dating from about 7,000 to over 47,000 years before present (BP). Here only 3 of the 14 (21%) samples (10,000-26,000 y BP) yielded amplifiable DNA, with Bayesian assignments to family or order level. Although the authors appear to have hoped for higher success, this seems pretty remarkable to me. They speculate that destructive sampling might have produced higher yields.

Saving what might be the best for last, Thomsen and colleagues tested non-frozen sediment samples that lacked visible insect parts collected in New Zealand caves and dated 1800 to 3280 years BP. Using a more or less standard extraction protocol developed by some of the authors (Willerslev et al 2003 Science 300:791), 96 bp fragments of COI (1 beetle, 1 butterfly) were recovered from 2 of 3 samples tested. The authors drily note “although the non-frozen sediment DNA approach involves destructive sampling, it has the advantage that the material is the sediment itself, which is usually abundant, and normally not too valuable to process.”

I conclude that if bits of DNA are preserved in ancient dirt then DNA from the past and present must be all around us. Perhaps single molecule sequencing methods will reveal an even greater abundance and diversity of DNA in environmental samples.

The Jack Rabbit of Depression, or Do economic slumps benefit environment?

Some wonder whether the present economic slump will elicit a change of direction and faster progress in reducing environmental harm.  Jesse Ausubel and Paul Waggoner, assisted by Smriti Rao, examined what happened to USA energy use and emissions during the slumps of the 1930s and after World War II.   A short essay, “The Jack Rabbit of Depression, or Do economic slumps benefit environment?” provides our answer.  Two 40-second animations prepared by Smriti show the year-by-year changes from 1920 to 1940 of energy intensity and carbon emissions.

Keeping a steady course rather than darting about was also a theme of Jesse’s address “Natural Gas and the Jack Rabbit” to the Power South Energy Cooperative on 22 January 2009.

Dinoflagellate diversity revealed by DNA

Peering into the vast diversity of life beyond multicellular eukaryotes (animals, plants, and fungi) is dizzying. In March 2009 Applied Environ Microbiol researchers from University of Connecticut assess dinoflagellate diversity with mitochondrial DNA sequencing. Dinoflagellates are unicellular, often photosynthetic, mostly marine plankton characteristically having two flagella and encased in a segmented hardened exterior. Dinoflagellate blooms are the cause of red tides, and dinoflagellate toxins ingested by fish and shellfish are the cause of ciguatera and paralytic shellfish poisoning. For unknown reasons, some species are bioluminescent when mechanically stimulated, producing glowing displays when perturbed by waves, fish, or kayakers, for example.

As a first step toward creating a reference library, Lin and colleagues compiled mtDNA sequences from 49 dinoflagellate species representing six orders (this included 20 COI and 60 cytochrome b sequences; 12 of the latter were newly obtained in this study). As there are about 2500 named dinoflagellate species, this is a sparsely-populated reference library so far. In addition, there were multiple samples from just 5 species, so intraspecific variation is not yet well-studied. As an aside, I note that most of the published and new sequences were derived from strains maintained at Pravasoli-Guillard National Center for Culture of Marine Phytoplankton (CCMP). There is no explicit mention of CCMP in the paper or GenBank depositions, although a plankton specialist would probably recognize the source from sample designations. More generally, there is no formal documentation of taxonomic identifications (eg collection sources for cultures or photographs for environmental samples and/or individual who performed identifications). Although this is not unusual in taxonomic papers, it seems to me that identifications should be as well documented as for example PCR conditions. 

In preparing the reference library, the researchers were unable to develop primers that amplified the barcode region of COI efficiently (ie the primers worked with some species and not others) and instead focused on cytochrome b using a primer pair that amplified a 385 bp segment. The primer difficulty is surprising given that COI is usually more conserved than cyt b (including in dinoflagellates), which should make it easier to design broad-range primers.  

The researchers then analyzed pooled environmental DNA samples prepared by filtering water specimens collected during different months at 3 marine stations in Long Island Sound and at a freshwater retention pond (Mirror Lake) on the University of Connecticut campus. While PCR products from monospecific cultures were sequenced directly, those from environmental samples were first cloned, and then 20 to 50 clones from each water sample were sequenced (total clones analyzed 450). 

Lin and co-workers obtained a large number of distinct haplotypes from the environmental samples; by my inspection of their phylogram nearly all of the clones (>420) were unique. Only a small minority could be assigned to known species or genera. On the technical side, the authors used a complex model of nucleotide substitution (TVM+G) to calculate differences among haplotypes and UPGMA to create trees, so their distance results and trees are not directly comparable to those in most DNA barcoding papers, which use K2P- or p-distances to calculate differences and neighbor-joining to create trees. In any case, according to the authors, the sequence results consistently showed greater diversity than was detected through microscopic analysis, “likely caused by the much higher detection sensitivity of PCR than of microscopic counting and by some genotypes that could not be discriminated morphologically.” The authors conclude “[w]hen a broader cob [cyt b] database becomes available, the taxon-resolving power of this gene would certainly increase.” I hope they or others will also develop efficient primer sets for amplifying COI in addition to cyt b

Looking ahead, the reference library can be augmented relatively inexpensively by analyzing mtDNA sequences of the 2400 strains at CCMP. However, the mtDNA diversity in this study suggests dozens of new species from just 4 sampling sites around Connecticut, implying the global total of undescribed species is very large. This suggests a need for some sort of “automated species identifier”: a machine approach that would sort samples into individual cells, then photograph, sequence, apply MOTU-type analysis, for example. In the meantime, it may be necessary to work with pooled sequences from environmental samples, as is done for bacterial communities, without attempting to delineate species.