Voucher and collection information in GenBank records

A core tenet of DNA barcoding initiative, beginning with the first workshops in 2003, is that reference sequences should be linked to vouchered specimens stored in museums, so that data can be re-checked. This also provides visibility to collections. For example, “GenBank DQ433554 Crotophaga ani voucher KU 89123 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial”  contains voucher information in the title and the record itself, at least for those who know “KU” refers to University of Kansas. The GenBank file contains a “LinkOut” to the BOLD page which spells out the collection name. The GenBank file (and the BOLD record) could also include a “LinkOut” to the museum itself, although I do not find examples of this feature being used.

bold-systems-specimen-record

More generally, is collection information available in GenBank records? Taking birds as an example, there are 475,273 GenBank avian records; eliminating the five most-represented species (Chicken, Turkey, Mallard, Zebra Finch, Fairy Wren) leaves 108,766 sequences, of which about half (48,915) contain the word “voucher.” This sounds promising but my unscientific sample suggests most entries in the “voucher” field are cryptic designations that do not identify the institution storing the specimen. I tried searching by acronyms for some of the larger collections. Louisiana State University has the largest avian tissue collection in the world with about 40,000 specimens; searching “LSU AND aves[organism] AND voucher” returned only 1,148 records, which seems likely to underrepresent the museum’s contribution. Results for some other large collections were higher but still appear to be incorrectly small considering there are 100,000+ avian GenBank records: (Burke Museum (UWBM) 3,318; Field Museum (FMNH), 2,593;  American Museum of Natural History  (AMNH), 1,994;  Smithsonian (USNM), 1,920; University of Kansas (KU), 684 records).

I conclude that researchers and collections will benefit from following practices promoted by DNA barcode initiative for GenBank records including taking advantage of GenBank’s “LinkOut” feature.

www.iBarcode.org: web tools for sequence analysis

cloudIn 16 june 2009 BMC Bioinformatics researchers from University of Guelph report on web platform for DNA barcode analysis, www.iBarcode.org. The site works with aligned barcode files in standard .fas format, such as produced by MEGA or BOLD. Registration is not required; the site keeps track of files you have uploaded.

According to authors Singer and Hajibabaei, iBarcode is designed to “allow the user to manage their barcode datasets, cull out non-unique sequences, identify haplotypes within a species, and examine the within- to between-species divergences.” iBarcode provides several clever, easy-to-use tools and I look forward to further refinements.
.
.
.
.
.
.

Lizard mitochondria converge on snakes–why?

https://en.wikipedia.org/wiki/Central_Bearded_DragonIn 2 june 2009 Proc Natl Acad Sci USA researchers from 5 American universities report on convergent molecular evolution among agamid lizards and snakes. In constructing a nuclear and mitochondrial DNA phylogeny of squamates (snakes and lizards), Castoe and colleagues noted their data placed agamid lizards as sister to snakes, rather than within lizard clade Iguania, as supported by prior work including morphology. The apparently aberrant phylogenetic placement was due to similarity among mitochondrial genomes of agamid lizards and snakes; nuclear genes recovered the established tree. Most of the aberrant signals were in first and second codon positions in protein-coding genes, and thus associated with similarity in predicted amino acid sequences among agamids and snakes. These convergent changes were distributed across all 13 mitochondrial protein-coding genes, but were clustered particularly in COXI and ND1.

The authors conclude that there was an ancient adaptive episode in the ancestors of today’s agamid lizards, which led to a snake-like mitochondrial genome. I note this conclusion is based on analyzing just 2 of the more than 350 species in 52 genera in Agamidae. Are these changes universal in Agamidae? There are 2 more complete agamid mitochondrial genomes in GenBank which could be examined; of additional interest would be to see if the same convergent changes are found in the 253 COI sequences from 88 agamid species in 11 genera in BOLD. As in this study, phylogenetic reconstruction usually involves just a few representatives of each lineage, which means that evolutionary patterns may remain invisible. I expect that BOLD will be an increasingly useful resource to expand the scope of phylogenetic studies utilizing mitochondrial DNA.

The conclusion that these findings represent convergent adaptive evolution is strong, yet it is also puzzling, as at first glance there doesn’t seem to be any special morphological or life-style resemblance between snakes and agamids as compared to other lizards. Perhaps we need to keep an open mind for other seemingly unlikely mechanisms, such as eukaryotic horizontal gene transfer.

Poisonous fish revealed

What fish is that you are eating? This question has many possible answers. Unlike meats, which are derived from a handful of species, most of which are farmed, there are numerous fish sold for human consumption, most of which are wild. The US FDA Regulatory Fish Encyclopedia and the Canadian Food Inspection Agency lists of approved fish and shellfish include approximately 1700 and  660 names, respectively. And yet DNA surveys regularly turn up fish in the marketplace that are not on any regulatory list, as well as mislabeling of those that are listed, suggesting we may not know what we are eating or what fish stocks are being harvested.

fish-soupIn addition to economic and environment impact, mislabeling can have public health implications. In April 2009 J Food Protection government and research scientists report on 2 cases of tetrodotoxin poisoning in Chicago, IL resulting from ingestion of soup prepared from mislabeled puffer fish, sold as “monkfish.”  Two additional cases were traced to the same supplier and this led to the recall of several thousand pounds of frozen fish. Morphologic examination of leftover parts and DNA testing of the cooked meat implicated Lagocephalus sp., most likely Green roughed-back puffer  L. lunaris. Unlike most other toxic puffer species, L. lunaris tetrodotoxin is in muscle as well as organ tissue, making safe preparation impossible. At the time of the study, there were no reference sequences in BOLD for L. lunaris, so the DNA barcode identification was incomplete. It would be of interest to repeat the database searches (as of today GenBank contains 1 L. lunaris COI sequence and BOLD taxonomy browser lists 2), but for some reason the sequences obtained by the researchers were not published.

DNA testing is the only way to identify many of the fish items in the marketplace. I expect that standardized DNA testing (aka DNA barcoding) will play an increasingly important role in helping protect both consumers and fish.

DNA helps reveal bat diets

What do carnivorous animals eat? Predation drives evolution and underlies ecology, yet except for a few easily observed species, it is surprisingly hard to determine what eats what. In June 2009 Mol Ecol, researchers from University of Guelph and University of Western Ontario, Canada, apply DNA testing to help solve diet of Eastern red bat Laiurus borealisL. borealis is the commonest tree-roosting bat in North America, ranging from Canada and United States east of the Rocky Mountains into Central and northern South America. Like other insectivorous bats, L. borealis uses echolocation to detect night-flying insects. Many moth species have evolved “ears” that detect the ultrasonic sounds emitted by bats and exhibit defensive behaviors in response to echolocation signals, making bats and moths an interesting study in predator-prey co-evolution. 

Clare and co-workers applied standardized DNA testing to insect parts in faecal samples collected from 56 mist-net trapped bats. Guano samples were frozen for up to 2 y then soaked in 95% ethanol for 12 h and examined with a dissecting microscope. Prey items including “legs, wings, antennae, eye cases, exoskeletal fragments, eggs” were isolated and stored separately in 96 well-plates. DNA extraction, amplification, and sequencing were performed using standard techniques and broad-range insect primers (LepF1/LepR1). COI sequences were compared to the 127,000 reference sequences of North American arthropods in BOLD database www.barcodinglife.org at the time of the study. Test sequences with >/=99% identity to reference sequence(s) and without equivalent similarity to other species in the database were given species-level identifications; those with less than 99% identity to reference sequence(s) were assigned to higher-level taxonomic categories. 

bat-dietsClare et al obtained sequence data from 89% of 896 arthropod fragments; 78% of these were identified to species or genus level (the remaining 22% showed sequence similarity to bacteria, fungi, or were unidentifiable or chimeric), with a total of 127 prey species identified (125 insects, mainly lepidoptera including a number of economically important pest species, and 2 spiders). The “molecular scatology” approach documented greater diversity in prey species than prior studies based on morphologic analysis. Most prey were identified only once, with an average of 3.5 species per guano sample. Surprisingly, “more than 60% [of recovered insects] appear to have ears capable of hearing the echolocation hunting calls of L. borealis.” The authors speculate the abundance of eared moths might reflect bats hunting around streetlights, as moths in such brightly-lit environments are thought to use daytime predator-avoidance strategies rather than nocturnal responses to echolocation. There was a notable absence of actiid and tortricid moths, given their local abundance, suggesting these moths may have alternative predator-avoidance strategies. 

This study documents the diversity of L. borealis prey, and hints at how much more we will learn from broad application of standardized DNA analysis to food chains, including such unexpected findings as possible disruptive effects of man-made lighting on local ecosystems.

Biggest tree so far

Phylogenetic tree-building programs are the workhorses of evolutionary analysis. Thus it might be surprising that, given there are at least 1.7 million named species of plants and animals, output trees with over 1000 taxa are exceptional. The primary reason is computational–the number of possible arrangements rises logarithmically with input taxa (eg for 1000 taxa, ~10^2500 possible trees; Tamura et al 2004), such that standard algorithms, even those that sample a fraction of “tree space,” are too slow. As a result, so far the Tree of Life has been constructed by concatenating multitudes of trees each built with relatively small numbers of taxa. This is unsatisfying and possibly unreliable.

In May 2009 Cladistics researchers from Argentina and Sweden report on the largest tree to date–73,060 eukaryotic taxa, essentially everything Goloboff and colleagues could find in GenBank, ranging from algae and protozoans to flowering plants and vertebrates. In addition to size, there were several remarkable features. The tree was constructed from just 13 genes, each of which was sequenced for a subset of the total (750 to ~20,000 taxa), plus 604 morphologic characters that applied across most of the data set. Nearly all (92%) of the cells in the resulting data matrix (73,060 taxa x 9535 characters) were empty due to lack of data. Nonetheless, the parsimony analysis recovered most eukaryotic groups down to the level of order as monophyletic taxa. The analysis utilized TNT software previously developed (and made publicly available) by Goloboff and colleagues and took 2.5 months on 3 desktop computers (total 96 GB RAM, 16 x 3 Ghz processors). To manage the flow of data, nearly all steps were automated from extracting, labeling, and aligning GenBank sequences to analyzing monophyly of groups at various taxonomic levels.

Looking ahead, the authors see biggest challenges not in tree-building, but in alignment software and “that the sequence information required is simply non-existent, and the morphological information is scanty and fragmentary.” I know that a short segment of a single mitochondrial gene is considered insufficient for phylogeny, but it would be interesting to see what TNT could do with 40,777 COI sequences from 6,506 fish species (FishBOL), for example. I imagine that even TNT might have trouble analyzing all 603,002 COI sequences of the 57,159 species represented in BOLD (with many more to come). Phylogenetic trees are established as the goal of evolutionary analysis, but we may need alternate methods for analyzing differences and similarities in very large data sets.

Jesse awarded an honorary doctorate

Dalhousie University bestows an honorary doctorate on Jesse, really an honor for everyone who has contributed to the work of the ‘Program for the Human Environment’ for the past 20 years.  We post Jesse’s Convocation address, titled “Son et lumiere“, discussing environmental dimensions of sound and light.

p.s. On 23 May 2009 Anne McIlroy of the Toronto Globe and Mail reported on Jesse’s address (p. F5)

Oceans speak volumes. Sound spreads widely in the world’s oceans, and the clamour of human activity reaches every cove, says Jesse Ausubel, director of the Human Environment program at Rockefeller University in Manhattan.

“Motors and propellers are noisy; so are jet skis and oil-and-gas exploration. In fact, we make the oceans three decibels noisier each decade”, he says. In a convocation address this week at Dalhousie University in Halifax, Dr. Ausubel proposed turning down the volume for four hours in an International Quiet Ocean Experiment. That would be enough time for thousands of researchers around the world to see how sea creatures respond to pre-industrial noise levels, he says. Would whales, for example, change the frequencies they use to communicate? If we can quiet things down, would they return to their normal, natural frequency rather than deepening their voices or raising their voices? he said an interview.

Dr. Ausubel has experience with ambitious, large-scale scientific projects. He played an important role in creating the Encyclopedia of Life, an online catalogue of the species on Earth, and was also involved in establishing the Census for Marine Life, an international program to chart life in the oceans by 2010.

Scientists from around the world who are interested in his Quiet Ocean Experiment will get together for their first meeting before the end of the year .

Dr. Ausubel acknowledges how difficult it will be to get four noise-free hours. Navies and the world’s maritime industries would have to be on board. “Maybe the time to do it would be Christmas Day,” he says. “We would like to inconvenience people as little as possible”.

Potatoes challenge taxonomists

In 7 May 2009 Amer J Botany, David Spooner, scientist at USDA and University of Wisconsin, applies DNA barcoding to wild potatoes. According to the author, “the taxonomy of sect. Petota [section Petota is a subdivision within genus Solanum which comprises wild and domesticated potatoes] is complicated by interspecific hybridization, introgression, allopolyploidy, a mixture of sexual and asexual reproduction and possible recent species divergences.” As an aside, this one genus Solanum contains over 1500 species, including such seemingly diverse plants such as nightshades, horsenettles, tomatoes, and eggplants. While the most speciose bird genera, for example, have fewer than 100 species, Solanum is one of at least 50 plant genera with over 500 species (Pelser et al 2002 Am J Botany). Such large genera are unwieldy for constructing phylogenies and testing DNA-based identification methods–do they reflect biological differences in rates of speciation among genera, or a lack of phylogenetic knowledge? 

The above summary of Petota taxonomy is an understatement of the confusion regarding species boundaries in wild potatoes. For one, the apparent number of taxa seems to be shrinking rapidly: “an account of post-1990 taxonomic decisions of many workers published in Spooner and Salas (2006) reduced the 232 species of Hawkes (1990) to 190, but a taxonomic decision in my laboratory is converging on about 110 species.” Second, experts can be perplexed: “members of the complex are so similar that even experienced potato taxonomists…provided different identifications for identical collections numbers of the Solanum brevicaule complex in fully 38% of cases.” Third, genetic analysis (including multiple studies in the author’s laboratory) has been little help so far: “single- to low-copy nuclear restriction fragment length polymorphism (nRFLPs) and random amplified fragment length (RAPD) data…and amplified fragment length polymorphism (AFLP) data failed to clearly differentiate many wild species in the complex.” Independent work by researchers in the Netherlands (Jacobs et al 2008) similarly documents a challenging lack of concordance between genetics and taxonomy in Petota sp. Jacobs and colleagues performed AFLP analysis (this screens the entire nuclear genome) on 951 accessions representing 196 Petota species. Of the 196 taxa, multiple accessions of species clustered together in 58 cases, 38 formed multiple clusters, and 48 were mixed with accessions of other species. Regarding higher-level groupings, these researchers found absence of support for 4 Petota clades proposed by Spooner and colleagues, and conclude that recent speciation and high levels of hybridization will likely challenge attempts to create a genetic taxonomy of wild potatoes. 

Given the above background, one might guess that a minimalist approach (ie DNA barcoding) using 2 or 3 plastid genes might not distinguish among Petota species whose underlying taxonomy and genetics are so jumbled. Thus I am puzzled why the author went to the trouble of performing this study, and why, having set out to do so, he analyzed only a single plastid gene (trnH-psbA spacer) when all recent plant barcoding studies I am aware of are based on a combined analysis of 2 or 3 plastid genes. The author also analyzed ITS nuclear gene segment (approximately 800 nucleotide segment containing ITS 1, 5.8S rRNA, and ITS2). This is interesting, although for some reason the phylogenetic analysis looked at ITS segment and trnH-psbA individually. I believe there is general understanding that a single barcode region will not suffice for distinguishing land plants. Lastly, I am puzzled why only 23 of 63 species analyzed were represented by multiple accessions. The author asserts “many barcoding studies lack robust assessments of intraspecific polymorphism or assessments of all species within a genus that are needed to assess the species-specific nature of barcodes;” as a general criticism I believe this comment is incorrect, but it does apply to the present study.  

To summarize the study, 104 accessions of 63 Petota species plus 10 accessions of 9 outgroup species were analyzed (the author does not comment as to whether the selections are drawn from the revised total of 110 Petota species as defined in his laboratory). Regarding ITS, 23 species were represented by more than one accession; of these 10 species formed monophyletic lineages, which seems surprisingly good species-level resolution for a single marker in plants. With trnH-psbA, 17 species were represented by more than one accession; of these only 2 formed separate clades (1 of which did not form a distinct clade with ITS); as above, combined analysis was not done. The author dismisses matK on the basis of two previously published sequences for Petota sp. Finally, the trees used parsimony not neighbor-joining, the latter being the usual first-pass method of looking at barcode data. I find this paper a haphazard assessment of DNA barcoding in a taxonomically intensively-studied but poorly understood group. 

High rates of horizontal gene transfer in archaea and eubacteria mean that it is not possible to draw clear species boundaries. It may be that relationships among potato species are similarly complex, and that species boundaries are fuzzier than the current taxonomy of morphologically-defined species would suggest.  It seems to me that more taxonomic and genetic work is needed on this important group, including better tests of barcoding with combined analysis of 2 or 3 of the standard plastid regions in multiple accessions from a larger number of species. The goal of a standardized minimalist approach to identifying species, including wild potatoes, is important to help move beyond having only experts being able to identify plant species.