DNA Barcoding – Page 27 – The Rockefeller University

Fungal database a Tower of Babel, needs rebuilding

April 27, 2007

Early in Michael Crichton’s 1990 novel Jurassic Park, Dr. Henry Wu, chief scientist at Jurassic Park Research Insitute, showing visitors around his facility, displays “the actual structure of a small fragment of dinosaur DNA“. Astute readers pointed out Dr. Wu’s dinsosaur genetic resuscitation project was unlikely to succeed, as the sequence in Crichton’s novel was a fragment of the bacterial plasmid pBR322. They discovered this by feeding the “dinosaur sequence” into the online BLAST software engine, which searches the billions of base pairs of nucleotide sequences deposited in the amazing public resource of GenBank and the other international genetic databases, EMBL and DDBJ.

The power of genetic databases as identification tools rests on the quality of sequences and their annotations. Just as we need regularly updated maps for safe navigation, we need regularly updated genetic databases for accurate identifications.

One of the strengths of GenBank is that it serves as a permanent repository for genetic sequence data. As a result, GenBank is sometimes a permanent repository for faulty data. In a recent PLoS One paper, researchers from Goteborg University and Chalmers University of Technology, Sweden, and University of Tartu, Estonia, examined the taxonomic reliability of the 51,534 fungal internal transcribed spacer (ITS) sequences in the International Nucleotide Sequence Database (ie GenBank, EMBL, DDBJ). ITS is the most widely used locus for species identification in fungi. The Tower of Babel, Bruegel, Pieter the Elder, 1563 The results show a “variegated picture of the taxonomic status of publicly indexed fungal sequences“. Taxonomic coverage is sparse: of the estimated 1.5 million fungi, less than 1% (9,684 species) are represented. Taxonomic data is lacking for many sequences (27% are not identified to species level), and most of the species-level identifications are unverifiable (82% are not linked to voucher specimens, 63% are not tagged with specimen country of origin, and 42% are marked as unpublished). Sequence comparisions suggest mislabeling is common (11% show best matches to congeneric but heterospecific sequences, and another 7% match among species of a different genus. Overall 10-21% of the INSD sequences have incorrect or unsatisfactory annotations.

It seems better to start over than to try to revise this Tower of Babel. Nilsson et al conclude “the large body of insufficiently identified fungi in INSD constitutes a silent plea for a wide and generalized sequencing effort of well-identified and -annotated [type] specimens residing in herbaria worldwide.” Toward this end, an All-Fungi Barcoding Initiative Workshop will be held 14-15 May 2007 at the Smithsonian Center for Research and Conservation, Fort Royal, Virginia. An international collection of researchers aim to hammer out how to build a reliable database, including which gene(s) should be adopted as standard barcode targets.

So far, DNA-based fungal identifications have primarily used ITS. Other nuclear genes have been used in some studies including the nuclear large ribosomal subunit, beta-tubulin, and elongation factor 1-alpha. It would be excellent if the fungal barcode database could link directly with those being built around the mitchondrial gene COI, which is effective for resolving most protozoan and metazoan (multicellular animal) species examined so far. In this regard it is exciting that a report by Seifert et al in 6 March 2007 Proc Natl Acad Sci USA shows COI provides species-level resolution similar to that for ITS, amplification was generally straightforward, and introns in the COI gene were found in only 2 of 370 Penicillium strains.

Resolving leaves and branches?

April 17, 2007

COI barcoding is a standardized approach to identifying species by DNA, helping resolve the “leaves” on the tree of life. Will the growing arrays of COI sequences also help provide insight into evolutionary history, the “branches” of the tree? I am struck that in some cases, simple genetic arithmetic with COI sequences creates trees very similar to modern phylogenies painstakingly created from multiple nuclear and mitochondrial genes, multiple morphologic characters, and exhaustive computerized analysis. Shown at right, a neighbor-joining analysis of COI barcodes of 264 species of North American birds creates a tree that is quite similar to the most recent understanding of relationships among modern birds, with anseriformes (ducks and geese) next to galliformes (grouse and quail) at the top, passeriformes (perching birds) at the bottom, and most of the other established orders appearing as single lineages in between. Most of the families and groupings of families within these orders also match current understanding, including for example, that flycatchers appear as the basal lineage within passeriformes, and a group of New World passerines called nine-primaried oscines appear together at the bottom of the tree. Of course, a systematist would immediately note that the statistical support for these higher-order branches is weak or absent. I should tread lightly here or not at all, since phylogenetics is the province of mathematical experts, but I will plunge ahead anyway and suggest that, if the single gene neighbor-joining distance tree is “right” say 8 times out of 10, then the apparent lack of statistical support for higher order branches is misleading. There are cases where the COI gene tree is incorrect (eg Hajibabaei et al Genome 49:851 2006).

I close with a picture inspired by the data. If single gene trees usually correspond to evolutionary history, this implies strong barriers to gene flow arise concurrent with differences in the single gene and are continuously maintained.

Seeing in color

April 10, 2007

A raft of recent articles show the strength and versatility of a standardized genetic approach to identifying species, ie DNA barcoding. Just as color vision helps us rapidly sort objects into unambiguous categories, DNA analysis usually reveals distinct differences among species, including those whose biological differences lie outside the range of our perception, ie cryptic species. I highlight two of the recent articles below.

In March 20, 2007 Proc Natl Acad Sci USA, researchers at University of Guelph, Canadian Agricultural Department, and University of Pennsylvania apply DNA barcoding to 16 species of apparently generalist parasitoid tachnid flies. Smith et al found 73 distinct mitochondrial DNA lineages among 2,134 flies from the 16 morphospecies. The mitochondrial lineages were supported by collateral ecological differences and, where tested, by independent nuclear gene markers. In an accompanying commentary, Scott Miller, Smithsonian Institution, looks at how DNA barcoding is contributing to the “renaissance of taxonomy” and is “emerging as a cost-effective standard for rapid species identification”.

In 26 March 2007 Mol Ecol Notes, scientists from the University of Auckland apply DNA to identifying rat species in Southeast Asia. Geographic variation in mitochondrial DNA of commensal rats provides a window into patterns of human dispersal and migration, but studies are complicated by the presence of multiple rat species in Southeast Asia, and the difficulty of distinguishing among species in subfossil remains at archeological sites. Robins et al found DNA barcoding with COI mtDNA barcodes distinguished most species, even when short DNA fragments of COI were used (such as might be recoverable from sub-fossil material), and was similarly effective as tree-based methods using COI, cytochrome b, and D-loop sequences. The genetic methods revealed some polytypic and paraphyletic species, suggesting a need for taxonomic revisions in this group.

Vast microbial genetic diversity found in oceans, stimulating new informatics tools

April 3, 2007

The biological universe is much larger and more diverse than we thought. In three papers in March 2007 PLoS Biology, scientists report on a genetic survey of microbial diversity in the world’s oceans. A large collaboration, the Global Oceanic Sampling (GOS), led by Craig Venter, analyzed microbial DNA collected by filtering seawater at 250 sites along a several thousand kilometer transect from the North Atlantic, through the Panama Canal, around the Galapagos Islands, ending in the Cocos Islands of the South Pacific. The resulting DNA dataset consisted of 6.3 billion base pairs (twice the size of the human genome), with 85% of the assembled and 57% of the unassembled data unique at a 98% identity cutoff. The extreme diversity prevented assembly of complete genomes, as many reads were unique. A comprehensive dataset of GOS sequences combined with pre-exisiting databases reveals nearly 6.12 million proteins, nearly doubling the number of known proteins. Some families of microbial proteins discovered in this study, particularly protein kinases, were previously thought to be restricted to eukaryotic organisms. Over 1700 sequence clusters show no identity to known families, implying we are far from knowing the full range of what proteins can do.

How to make sense of all this data? First, more data is needed!, namely more complete genomes into which the unassembled fragments can be placed. Second, new analytic tools. A new genomics and informatics group based at the California Institute for Telecommunications and Information Technology in San Diego, have built a metagenomics version of GenBank, known as the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (try saying that 3 times quickly!) which is fortunately known by acronym CAMERA.

Just as Google and other search engines solved a problem of information overload that did not exist a few years ago, I am confident that CAMERA and other new informatics tools will enable us to view the expanding universe of environmental genomics, including DNA barcode libraries, in ways that will provide new understanding.

Routine DNA testing in fish industry to help people and fish

March 24, 2007

Over 1000 fish species can be legally sold in the United States, a challenge for accurate labelling. Many fish products such as fillets cannot be identified to species, even by experts. DNA surveys suggest that at least for some expensive species, most fish products are mislabelled. In 2004 Nature 430:309, scientists at University of North Carolina analyzed mtDNA of fish labelled as red snapper, which by US law can only be applied to a Caribbean snapper species, Lutjanus campechanus. 77% (17/22) fish purchased from 9 vendors in eight states were not L. campechanus, and most were species from other regions of the world, or could not be identified to species due to lack of reference sequences.

More recently, the availability of commercial DNA testing has enabled enterprising news stations to do their own research. Last year a Florida television station found that 6 of 11 restaurant entrees labeled as local grouper were other species, including Asian catfish and tilapia, and last month a Los Angeles television station reported that red snapper entrees at 4 local restaurants were either tilapia, catfish, or mahi mahi. Following up on the news media, the Florida Attorney General’s office did their own testing, found 17 of 24 restaurants sold entrees mislabeled as grouper, and made legal settlements. What is needed is needed is a widely available method backed up by a reliable reference library that can be routinely applied to identification of fish and fish products in the marketplace. DNA barcoding is designed to be just that.

The Food and Drug Adminstration (FDA) Regulatory Fish Encyclopedia (RFE) aims “to assist with the accurate identification of species and help federal, state, and local officials and purchasers of seafood identify species substitution and economic deception in the marketplace.”

The species pages include scientific and common names, pictures of whole fish and fish products, analytic gels of fish proteins, and excitingly, an empty space for reference DNA sequence information. For reliable identification, the fish reference library needs comprehensive taxonomic coverage and adequate sampling of variation within species, ie DNA barcoding. I believe the Fish Barcode of Life Initiative (Fish-BoL), which has already collected barcodes from over 16,000 specimens representing more than 3500 species, will provide a widely used tool that will benefit consumers and the many species of fish that require management or protection.

Nature celebrates taxonomy as Carl Linnaeus turns 300, discovers DNA barcodes

March 20, 2007

A dozen articles in current issue of Nature examine the legacy of Carl Linnaeus, born 300 years ago this May. The wonderful cover illustration shows Carl Linnaeus as a modern field biologist in blue jeans and down vest, holding up a DNA barcode. I am particularly struck by Charles Godfray’s “Linnaeus in the information age,” a wide-ranging, thoughtful, visionary and practical look at how taxonomy might evolve so as to provide the widest benefit to society. His near-term wish list includes “a comprehensive web-based taxonomic and identification resource (morphology plus DNA barcodes) for the world’s macrolepidoptera”, which “would be a wonderful lever for bringing new resources into the field.” A central theme echoed in many of the pieces is the need for taxonomists to join together and create “big science” projects that benefit the many end-users of taxonomic knowledge.

Indomalayan All Birds converge in Singapore

March 8, 2007

A flock of 22 scientists converged on National University of Singapore on March 8-9, 2007 for a 2-day Indomalayan Organizational Meeting for All Birds Barcoding Initiative (ABBI), including representatives from India, Indonesia, Malaysia, Philippines, Singapore, Sri Lanka, Thailand, and individuals from European and North American museums with active collaborative research programs in this region.

Asian fairy bluebird The Indomalayan biogeographic region spans a vast area of tropical biodiversity and includes inumerable islands with high numbers of endemic species. A large scale genetic survey with DNA barcoding is likely to help lead to dramatic increases in species counts in particular and better understanding of biodiversity in general. Additional collecting may be particuarly important in this region, as it is at present the least well-represented in frozen tissue collections. There was strong enthusiasm among regional participants, and recognition the initiative has public appeal and the potential to engage new sources governmental support.

I look forward to organizational and scientific progress in this exciting region.

Horizontal genomics with bats and birds

February 24, 2007

Two papers in early online Mol Ecol Notes report large scale COI surveys of tropical bats and North American birds. In the first paper, Clare et al examined 840 specimens representing 87 (72%) of 121 known bat species in Guyana, each derived from vouchered specimens held at Royal Ontario Museum, including multiple individuals (range 2-74) from 73 (84%) of species. 81 of 87 species had distinct COI barcodes with average intraspecific variation of 0.6%. In the remaining 6 species, 15 distinct mitochondrial lineages were found which likely represent overlooked cryptic species.

As most bats are small brown animals that fly around at night emitting noises that humans cannot hear, it is not surprising that some have been overlooked, and it seems probable many new species will be found lurking in museum drawers. Even in relatively bat-poor temperate regions there may be hidden diversity. It was not until 1997 that Europe’s most abundant and best studied bat, the Pipistrelle (Pipistrellus pipistrellus, Schreber 1774) was suggested to be 2 species through DNA analysis, a hypothesis confirmed by biological covariants and official species designation in 1999.

In the second paper, Kerr et al (I am a co-author) report a continental-scale survey of mtCOI sequences in North American birds, including 2590 individuals from 643 species, representing 93% of the breeding avifauna of Canada and the United States. 94% of species had distinct barcodes, and in the remaining 6%, barcode clusters corresponded to small sets of closely-related species, most of which hybridize regularly. Fifteen (2%) of currently-recognized species were comprised of two distinct barcode clusters, many of which may represent cryptic species.

Birds being conspicuous, vocal, diurnal animals it is surprising that there are what appear to be overlooked species, even in an intensively-studied temperate region with relatively few species. Of course barcode clusters are not proof of species status, but to my knowledge all such divergent lineages either correspond to recognized species, or have subsequently been found to show biological covariants and have ultimately been granted species status.

Sequencing of large tissue collections housed in museums can be done relatively rapidly and inexpensively. It is a challenge on how to report results in a way that communicates the genetic findings in a timely fashion without trampling on the careful procedures designed to maintain order in taxonomy.

By using tissues derived from vouchered museum specimens, these barcoding studies lay the groundwork for subsequent taxonomic study. By analyzing a standardized region, DNA barcoding studies can be stitched together to create a large-scale map of biodiversity, a horizontal genomics approach mapping leaves on the tree of life.

I see the “barcode map of genetic diversity” as analogous to an astronomical sky map that uses just a slice of the electromagnetic spectrum. It does not contain all the information necessary to understand the universe, but by focusing on one part of the spectrum it enables results from various studies to be seamlessly combined and allows both large and small scale comparisions.

“Why every protist needs a barcode”

February 13, 2007

In February 2007 Microbiology Today, scientists report on the Barcoding Protists Workshop held in Portland, Maine in November 2006, which was attended by 40 protist experts from 12 countries (Australia, Canada, Denmark, France, Germany, Japan, Malaysia, Netherlands, Norway, Russia, UK, and USA). The workshop was co-sponsored by the US National Center for Culture of Marine Phytoplankton and the UK NERC Culture Collection of Algae and Protozoa.

According to Williamson et al, “most original descriptions for [over 200,000 named] protist species are based on light microscopy and ink drawings, not only making species identification for some groups an inherently subjective and specialist occupation, but also potentially hiding major genetic diversity.”

Workshop participants agreed unanimously that “to help resolve many of the contradictions and uncertainties in protist taxonomy, genetic barcoding is the way forward, starting with material, particularly type strains, in internationally recognized culture collections.”

COI characters resolve chitons, distances do also

February 4, 2007

In early online 12 Jan 2007 Mol Ecol Notes researchers from Columbia University, American Museum of Natural History, and California State University analyze COI barcode region sequences of 131 individuals representing 19 species of Mopalia chitons. Chitons are molluscs with flattened segmented shells, and most of the 860 known world species are herbivores that graze in tidal zones, although some are found at depths up to 6000 meters. According to the authors “much of the biology of [Mopalia sp] remains undiscovered” because many “are difficult to distinguish from one another by morphology alone”, making them a good test case for DNA barcoding.

Kelly et al compared three approaches for identifying Mopalia chitons by COI. First, they used a “character based assessment called characteristic attribute organization system (CAOS)”. In this approach, a “guide tree” is generated using maximum likelihood or parsimony, and CAOS identifies sets of characters for each node in the guide tree. CAOS then attempts to assign unknowns based on these characters. If there is insufficient information to assign the query sequence, CAOS stops the analysis. The authors compared CAOS to neighbor-joining distance analysis on Barcode of Life Data Systems (BOLD) site, and to BLAST algorithm. All three approaches had overall accuracy of 100% when provided with the entire data set. CAOS was superior to NJ and BLAST when a skeletonized reference set containing of 50% of the total sequences was used.

CAOS automatically identifies diagnostic molecular characters, and this will help integrate DNA barcode data into traditional taxonomy. For practical use, diagnostic sequence differences may aid design of solid-state microarrays that detect species in environmental samples, such as the 0.1mm Mopalia mucosa planktonic larva shown here, which might be found floating in seawater, or in the stomach of a krill.

Rockefeller University

Program for the Human Environment

Area of Research: DNA Barcoding