DNA Barcoding – Page 29 – The Rockefeller University

The spider and the fly: learning about applying COI to species identification

November 23, 2006

Two recent articles suggest how and how not to learn about applying mtCOI sequences to identifying species. In Zoologica Scripta 2006 35:441 researchers from Koenig Zoological Research Museum, Bonn, analyze 113 specimens of 61 morphologically-defined species of pholcid (daddy long-legs) spiders. Important for this analysis and for future study, collection locations are given and voucher numbers are provided for each specimen and DNA extract.

(Some pholcid spiders vibrate in their webs when disturbed, moving so rapidly they become invisible; here is a wonderful video)

16s and COI sequences were successfully amplified using a single primer pair for each gene from 79% and 80% of specimens, respectively. It is striking that strong clustering within species was observed despite using short segments of mtDNA (COI, 312 bp; 16s 287 bp), which are less than half as long as the standard 648 bp COI barcode. In NJ trees with either mtDNA sequence, all morphologic conspecifics grouped together and were reciprocally monophyletic (ie no overlaps between species). Likely splits based on large intraspecific distances and differing geographic distributions were observed in 6 (25%) of 24 multiply-sampled morphospecies.

The authors go on to propose graphic and statistical metrics to calibrate how well simple distances can define species limits. They find that mtDNA distances will often diagnose species: “tree-based taxon clustering and statistical taxon analysis indicate that molecular evidence does coincide with morphological hypotheses” and “we disagree with [Meyer and Paulay’s] point that independently of the group of organisms studied, a “barcoding gap” between interspecific and intraspecific distance values would likely disappear in studies featuring both dense within-species sampling and closely related species”, ie distance-based clustering often corresponds to species limits.

This study uses vouchered specimens from known locations and accurate modern sequencing technology, focuses on a relatively small clade (959 known species pholcid spiders), and analyzes in a positive way how distance measures might be used to define species, helping us learn about DNA barcoding as a tool for species identification.

In another recent study Syst Biol 55:715 2006 researchers from National University of Singapore examine COI sequences deposited in GenBank from Diptera (flies, mosquitos, and gnats). They found 449 of the 150,000 known species of dipterans represented, with multiple sequences from 127 species, and analyze these to “test two key claims of molecular taxonomy”. The scientists found that there were often large differences in COI within species and also frequent overlaps between species such that some sequences were more closely related to those of another species than to conspecifics. The litany of failures is quite long, including “even when two COI sequences are identical, there is a 6% chance they belong to different species”.

I do not understand why the authors put so much effort into analyzing such a heterogeneous set of data, except that they are worried about molecular taxonomy in general and DNA barcoding in particular. To my reading this study suggests that many GenBank records contain errors, either because current morphologic taxonomy is incorrect (for example, study cited above suggests probable splits in 25% of pholcid spiders), specimens used for GenBank records are incorrectly identified, or because DNA sequences in GenBank contain errors due to human factors or older sequencing technology. There must be some limitations to COI barcode identification of dipteran species, mostly presumably closely-related young species, but this study has not shown where such problems might lie.

I hope that future studies will use more of the “best practices” demonstrated in Astrin et al’s study of pholcid spiders and so help us learn more about how to apply COI sequences to species identification.

Galling thrips split by mtCOI

November 16, 2006

Thrips are tiny (.5 to 2 mm) plant feeding insects; approximately 4500 species are known, and some are serious agricultural pests. Kladothrips is an Australian genus of at least 35 species which form galls on Acacia trees. In Biol J Linn Soc 2006 88:555 researchers from Flinders University, Australia, apply mtDNA analysis to show that two gall morpho-types of Kladothrips rugosus represent different species.

Originally described in 1907, K. rugosus is widely distributed across south and western Australia. Two gall types were noted, but no morphologic differences could be found in the thrips themselves. McLeish, Chapman, and Mound found pairwise uncorrected mtCOI p-distances were 0.0-0.6% within gall morphotypes, and 7.4-7.8% between, similar to distances within and among other gall thrips species. The authors aver the usual taxonomic distaste for distance measures (“Distance values are not intended as a means of identifying different species here, which is a problematic approach for species depiction, but as useful descriptors of genetic variation”). I translate this as distance measures can be used help discover new species, but are verboten in official species descriptions.

The only morphologic differences are that “abdominal segments I-III are as brown as IV-VII, the metathorax is scarcely paler than the brown mesothorax and prothorax, and the sculptured reticles on the posterior half of of tergites II-III are all small and equiangular.” Phew! Not many persons could decipher such abstruse morphologic terminology, whereas DNA-based identification promises more democratic access to species identification. The main limiting factors are technological and likely solvable: establishing reference libraries and developing inexpensive DNA analytic methods.

The authors found a third genetic cluster in K. rugosus, but were unable to discover any morphologic characters, so did not describe this as a new species. This seems scientifically inconsistent, and the authors seem to agree: “This lack of morphologic divergence has evident problems for traditional taxonomy..we suggest that “morpho-taxonomy” is little more than an historical artifact in the methodology of species recognition, despite commonly providing the most practical methods”

I hope the large data sets emerging from the barcode initiative and other genetic surveys will enable taxonomists to develop consistent methods of species delimitation, whether in thrips or thresher sharks, and the sequences themselves or their diagnostic nucleotide characters will be routinely incorporated into species descriptions.

New data point to need for better theories about species formation

November 7, 2006

Insiders can be mistaken, in science and in other fields. At the beginning of the Human Genome Project, “the great majority of scientists dismissed the original proposal with hostility or indifference” (Great 15-year project to decipher genes stirs opposition. New York Times, June 5, 1990). The Times article details some of the initial negative reactions:

“Even if scientists manage to finish the genome project, it will have generated enormous reams of uninterpretable and often useless data”.

“The human genome project is bad science, it’s unthought-out science, it’s hyped science” said Dr. Martin Rechsteiner, a biochemist at the University of Utah. Some critics have begun aggressive letter-writing campaigns, urging colleagues who harbor similar sentiments to write Congress.

“Everybody I talk to thinks this is an incredibly bad idea,” said Dr. Michael Syvanen, a microbiologist at the Medical School of the University of California at Davis and a stout antagonist of the genome project.

Professional societies weighed in as well. A resolution adopted by the Council of the American Society for Biochemistry and Molecular Biology, and endorsed by the Federation of American Societies for Experimental Biology stated: “A large scale, massive effort to ascertain the sequence of the entire genome cannot be adequately justified at the present time… The Council wants to state in the clearest possible terms our opposition to any current proposal that envisions the establishment of one or a few large centers that are designed to map and/or sequence the human genome.” https://www.fasebj.org/cgi/reprint/1/6/502

This history comes to mind in reading the article by Hickerson, Meyer, and Moritz in October 2006 Syst Biol 55:729. According to their analysis, mathematical modelling predicts that DNA barcoding will often fail to discover young species. Their analysis is based on a classical model of speciation (Bateson-Dobzhansky-Muller) and “well-established population genetic theory”. I should tread lightly here, not being a population biologist! To my reading, these mathematical models are either unsupported or disproved by experimental evidence. The BDM model of biological species formation is “well-characterized, tractable, and its dynamics captures a range of speciation times implicit across many pre- and post-zygotic isolation models”, ie good for modelling, but is not derived from actual genetic data on differences between sister species. Genetic surveys including growing barcode libraries demonstrating limited intraspecific variation in diverse species across enormous differences in population size and generation time indicate that “well established population genetic theory” does not explain intraspecific mitochondrial diversity (Bazin et al 2006 Science 28:570).

Instead of making predictions about why barcoding will fail, I hope the same mathematic rigor will be applied to understanding why barcoding works as well as it does, why the variation within most species is low, why the distances between most species are large, and what determines the exceptions.

The fastest way forward

November 2, 2006

In October Proc R Soc B Gomez et al apply DNA barcoding to the cosmopolitan marine bryozoan Celleporella hyalina. Morphologic identification in this genus uses scanning electron microscopy measurements of the 0.2 mm autozooid and its 0.05 mm orifice. To eliminate potential variability associated with colonial development or environmental plasticity, these morphologic measurements are made on cloned F1 progeny grown under controlled laboratory conditions. This example highlights how standard morphologic techniques can be cumbersome and costly, and require highly-trained personnel and expensive equipment. It is unlikely this sort of morphologic identification process can be sped up, while DNA analysis is getting faster, cheaper, and more portable.

The researchers from University of Hull, University of Wales, and Universidad Catolica de la Santisima Concepcion in Chile analyzed mtCOI barcodes in 176 colonies from 33 sites around the globe, revealing at least 10 deeply divergent lineages. Mating compatability in 26 pairwise combinations showed complete reproductive isolation in 23 cases, and 3 were inconclusive due to self-fertilization. Only one of the genetically divergent, reproductively incompatible groups could be reliably separated by morphologic analysis.

It is obviously impractical to do mating studies for routine identification of bryozoans. Instead, standardized genetic analysis, ie DNA barcoding, can first help discover species (as in this case by highlighting lineages that were then subjected to other forms of biological analysis), and then be applied to assign unknown specimens to the newly revealed species. The authors conclude “DNA barcoding clearly identifies biologically meaningful groups in the C. hyalina complex” and speculate that biodiversity is similarly underestimated in other sessile marine invertebrates, including sponges and corals. “Failure to recongize cryptic speciation among sessile benthos therefore may seriously underestimate marine biodiversity as well as impeding attempts to predict the response of marine benthos to environmental change.” I conclude that DNA barcoding is the fastest way forward to help discover and then routinely identify what appear to be the vast numbers of cryptic animal species.

Sponge barcoding on the web

October 26, 2006

The Sponge Barcoding Project https://www.spongebarcoding.org/ aims to barcode all described sponges, about 8,000 species in the phylum Porifera. The initial phase of 3 years will focus on 2,000 species covering all genera.

Sponges are thought to be the earliest living branch on the multicellular animal tree and are difficult even for experts to identify. In addition to their ecological importance, sponges are sources for novel pharmaceuticals and biomaterials (eg Sipkema et al 2005 Biotech Bioengineer 90:201).

Like some corals, some sponges show very few differences in mitochondrial DNA with the standard COI barcode (corals, Shearer et al 2002 Mol Ecol 11: 2475; sponges, Erpenbeck et al 2006 Mol Ecol Notes 6: 550). The latter study suggests that the 3′ end of COI may provide greater resolution for Porifera and Cnidaria. An important goal of the initial phase of the project is to determine the best strategy for obtaining species-level identifications, one that provides sufficient resolution to separate most of the closely-related species and still takes advantage as much as possible of the benefits of standardization on 5′ COI.

I note that in animals closely related sister species are often largely or completely allopatric. In such cases, combining genetic barcode data with GIS coordinates may improve the certainty of some identifications.

For fun, I close with a sponge video:

Worried taxonomists see future, declare it won’t work

October 18, 2006

In 1987, the few dozen GPS models available were mostly larger than 200 cu in and cost $15,000 to $45,000. (https://adsabs.harvard.edu/abs/1988inna.meet..158C) Today there are thousands of models, many for under $100, and GPS technology is cheap enough and small enough to be a commonplace add-on to cell phones and PDAs. At the start, few could have recognized how ubiquitous GPS technology would be. It is now difficult to find paper road maps.

In current Systematic Biology 55:844 (not open access so no article link here), three worried taxonomists opine that DNA barcoding won’t work because it’s too expensive. This is likely incorrect. Any process involving electronics and/or chemicals is likely to become faster, cheaper, and smaller, perhaps dramatically so. Just as with GPS, lowered costs increase use, and increased use helps lower costs. On the DNA front, researchers are developing microfluidic grapefruit-sized sequencers that analyze sub-femtomole quantities of DNA in nanoliter volumes, with proportionally reduced reagents costs. Alternative technologies such as pyrosequencing may be even faster and cheaper.

No one can predict the course or speed of technology development, but betting against technology is short-sighted. The authors Cameron, Rubinoff, and Will do allow that a database of 10 million species might be assembled for about $0.5 billion dollars, which would be “much cheaper than other “big science” initiatives”. Not content, they decide that 10-fold more specimens and 3-fold more genes will be needed and presto, the cost is $15 billion!

I believe the costs of assembling the database and of routine applications will be lower than they estimate. In addition, the database will be widely used long before it is “complete”. Looking ahead, taxonomists need to make more clear the scientific and practical value of cataloging, whether by traditional means or DNA-led discovery, the estimated millions of as yet undescribed species, most of which are thought to be insects in tropical regions.

Where will the money come from? The authors worry about effects of the Barcode of Life Initiative on current taxonomy, but their own calculations show there is no need for concern. There simply is not enough money in traditional taxonomy, which they estimate has global support of $10 million/year. To my knowledge all funding for barcoding work so far has come from sources that do not support traditional taxonomy. The situation is similar to the Human Genome Project, which worried many biomedical researchers who feared it would consume all their resources. Partly as a result, the HGP was not funded by National Insitutes of Health or National Science Foundation, but by the Department of Energy. The Barcode Initiative will help establish Environmental Genomics as a field of science, will open up new sources of support to do so, and will not subtract from traditional taxonomy.

Limited access limits taxonomy

October 12, 2006

Good news for taxonomic science: “New type of mouse discovered in Cyprus” today made headlines on 193 sites around the world including BBC, Reuters, CNN, ABC, NBC, Fox News, International Herald Tribune, and Associated Press among others. According to the press release from the researchers at University of Durham, UK, “genetic tests confirmed that the new mouse was of a different species”, suggesting once again that, theoretical protests aside, in practice many taxonomists embrace DNA-based “species discovery” as well as DNA-based “species identification”.

Bad news for taxonomic science: To learn more about what the researchers discovered, you have to purchase a journal subscription.

The press coverage of this article demonstrates discovery of new species is of wide public interest, and there are many persons who would want to read beyond the headlines. As it stands, readership is often limited to a small number of specialists, guaranteeing continuing obscurity for taxonomic science. Open access for new species descriptions could help increase visibility and willingness to fund taxonomic science.

Test flying DNA barcode identification

October 6, 2006

Collisions between birds and airplanes, known as birdstrikes , are an expensive hazard for civilian and military aircraft. Identification of airstrike specimens enables bird management near airfields and is essential for improvements in aircraft safety design. Forensic ornithology laboratories

(see for example, Laboratory for Feather Remains Identification in Tel Aviv) have relied on microscopic examination of feather barbules. Identification of birdstrikes through DNA barcoding seems likely to prove a reliable, reproducible, and rapid alternative. Here I try test flying a barcode approach, and compare to a Genbank BLAST search.

This simulation tries out what barcode identification might be like once reference libraries are established, and corresponds to “species identification” (vs species discovery) in last week’s post. A sequence was selected from Barcodes of Life Data Systems (BOLD) (130,000 COI barcode sequences from 19,000 species so far) and pasted into public “Identification Engine” on BOLD home page.

Voila! A probable identification with a disclaimer of infallibility, a list of the top 20 closest matches, and a graphic display of the closest 100 in the database. One more click creates a neighbor-joining tree with species names and collection sites (in the tree at left, species clusters are numbered, and the species and site names are omitted).

Skipping over to Rock Pigeon Columba livia page at All Birds Barcoding Initiative (ABBI) website reveals a Google map of specimen locations.

So far the BOLD database contains sequences of 24 (8%) of the 309 Columbiformes (pigeons and doves) with an average of 4 specimens per species. More contributions will establish a comprehensive reference library.

A BLAST Genbank search with the C. livia COI sequence also shows C. livia as the closest match, but only a few closely-related birds. All COI sequences in BOLD are or will presumably be deposited in GenBank, but to date many are not yet public. For a more robust comparison, I tried a C. livia cytochrome b sequence, as cytb has historically been favored by vertebrate biologists (and COI by those studying invertebrates). The C. livia cytb sequence naturally matches most closely with C. livia, with C. rupestris as the sister species, the same pattern as with COI (in tree at left, C. rupestris is species 2). It is also possible to draw a NJ tree with results of BLAST search.

There are two obvious differences in the databases. First, Genbank BLAST output including the NJ tree does not show collection sites, which are helpful or essential when assessing variation within and among species. To find this information, one would have to go back to original publications which may be inacessible or not include this data, and many sequences are deposited without any published reference.

Second, in GenBank most species are represented by a single sequence. One of the strongest benefits of the barcode initiative, for those interested in population biology and species level-taxonomy, as well as for reliable identification, will be the collection of barcodes from multiple specimens for each species.

Dueling taxonomists agree: DNA barcoding invaluable for species identification

September 27, 2006

In October 2006 Conservation Biology, Rob DeSalle, American Museum of Natural History, comments on April 2006 CB piece on barcoding by Daniel Rubinoff, University of Hawaii (also see earlier Barcode Blog post on this article), and DeSalle’s commentary is followed by reply from Rubinoff. Such fun! DeSalle divides his analysis of DNA barcoding into its utility in “species identification” ie assigning specimens to known species, and “species discovery” ie formal descriptions of new species previously unknown to science. Both researchers cede the field of “species identification” to DNA barcoding (DeSalle: “a species identification system based on DNA sequences [would] be reliable, consistent, and rapid”; Rubinoff: “barcodes could be invaluable for speed and accuracy”).

Establishing reference libraries will be a major scientific achievement on the scale of the Human Genome Project. As these become available, the limiting factors for DNA barcoding will be cost and availability of sequence analysis, but as DeSalle observes, “it is not unreasonable to assume that DNA technology will advance to the stage where field-based diagnostics can be accomplished”.

Since there are about 1.1 million named multicellular animal species, “species identification” is a vast area for scientific research and practical application of DNA barcoding. It seems likely that the 1.1 million known species includes most of the more abundant and wide-ranging species, and most that are of direct economic or scientific importance to humans. It is generally believed there are many more undescribed species than what has already been named. These may be largely rarer species with limited distributions (see earlier post on “rare microbial biosphere“). Population sizes and ranges in the undescribed biosphere, together with measures of genetic diversity (see last week’s post) might be interesting research areas.

The rest of the exchange centers on the role of DNA barcoding in “species discovery” ie formal descriptions of species previously unknown to science. Here both turn cautious, reserving an essential role for expert judgement. (DeSalle: “DNA sequence information in the absence of other corroborating evidence can never be used by itself as an indicator of species delimitation”. Rubinoff: “my opposition is to the practice by which species are known and identifiable only through a DNA barcode”.)

It may be that it takes a taxonomist to recognize a new species and that it is essential to use an integrated approach combining morphology, ecology, together with DNA sequence data. However, I am struck that in practice taxonomists often apply a “DNA-first” approach and that an untrained person could recognize most of the sequence clusters that correspond to species. A Google search with “new species” and “dna” turns up dozens of reports in which DNA sequence differences are the first and strongest evidence for cryptic species, including the 2 new shark species shown below.

It is surprising that the published description of the new shark species shown above did not include any DNA sequence data!

I close with a question about distance vs character comparisons of sequence data. Both authors assume that “distances” cannot be part of species descriptions, “characters” are needed. I wonder if this is a scientific fact, or one that reflects the social norms of taxonomy. As discussed in recent posts, neighbor-joining distance comparisons show most animal species as tight clusters, distinct from those of other species. Once a reference library is established, why not use distance clustering as a diagnostic, eg “species X COI barcodes lie within cluster 1439″?

Small houses on big lots

September 21, 2006August 7, 2024

Most animal species correspond to tight clusters of mtDNA distinct from those closely-related species. In real estate terms, most species are small houses on big lots. Small houses because intraspecific variation in mtDNA is generally low, and big lots because distances between species are generally large.

Two recent posts looked at “house size”, or mtDNA distances within species. The finding of limited variation within most species calls out for research into mitochondrial genetics. Here I examine the other half of what species-level mtDNA maps show: “lot size”, or mtDNA distances between species. This refers to MINIMUM distances between species, ie the genetic distance between a species and its nearest neighbor on the mtDNA map. “Nearest neighbor” is more inclusive, and likely more appropriate for testing speciation/extinction models, than the subset of “sister species” which refers only to the most closely-related species pairs. Species without close relatives, and species whose closest relative belongs to another sister species pair are usually omitted from compilations of sister species.

It is long observed that distances between most animal species are larger than distances within (eg Moore 1995 Evolution 49:718). What is exciting is that there is now enough barcode data to allow scientifically interesting comparisons among groups. For example, the figure below shows average “lot size”, or minimum distance between species, is surprisingly similar in two large assemblages of butterflies and birds (nearest neighbor analysis performed using software and sequence data on Barcode of Life Data Systems (BOLD)).

A potentially fruitful line of inquiry might be to examine nearest neighbor distances among allopatric vs. sympatric species. The distribution of nearest neighbor distances will likely be of interest to those studing birth and death of species (eg Nee 2001. Evolution 55:661). In the histogram of congeneric nearest neighbor distances among skipper butterflies shown at left, it is perhaps surprising the distribution is not a “hollow curve” (eg Scotland and Sanderson 2004. Science 303:643). Which models of speciation are consistent with observed distributions of genetic distances among species?

Rockefeller University

Program for the Human Environment

Area of Research: DNA Barcoding