Vast microbial genetic diversity found in oceans, stimulating new informatics tools

The biological universe is much larger and more diverse than we thought. In three papers in March 2007 PLoS Biology, scientists report on a genetic survey of microbial diversity in the world’s oceans.  A large collaboration, the Global Oceanic Sampling (GOS), led by Craig Venter, analyzed microbial DNA collected by filtering seawater at 250 sites along a several thousand kilometer transect from the North Atlantic, through the Panama Canal, around the Galapagos Islands, ending in the Cocos Islands of the South Pacific. The resulting DNA dataset consisted of 6.3 billion base pairs (twice the size of the human genome), with 85% of the assembled and 57% of the unassembled data unique at a 98% identity cutoff. The extreme diversity prevented assembly of complete genomes, as many reads were unique. A comprehensive dataset of GOS sequences combined with pre-exisiting databases reveals nearly 6.12 million proteins, nearly doubling the number of known proteins. Some families of microbial proteins discovered in this study, particularly protein kinases, were previously thought to be restricted to eukaryotic organisms. Over 1700 sequence clusters show no identity to known families, implying we are far from knowing the full range of what proteins can do. 

How to make sense of all this data? First, more data is needed!, namely more complete genomes into which the unassembled fragments can be placed. Second, new analytic tools. A new genomics and informatics group based at the California Institute for Telecommunications and Information Technology in San Diego, have built a metagenomics version of GenBank, known as the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (try saying that 3 times quickly!) which is fortunately known by acronym CAMERA

Just as Google and other search engines solved a problem of information overload that did not exist a few years ago, I am confident that CAMERA and other new informatics tools will enable us to view the expanding universe of environmental genomics, including DNA barcode libraries, in ways that will provide new understanding.

Routine DNA testing in fish industry to help people and fish

Over 1000 fish species can be legally sold in the United States, a challenge for accurate labelling. Many fish products such as fillets cannot be identified to species, even by experts. DNA surveys suggest that at least for some expensive species, most fish products are mislabelled. In 2004 Nature 430:309, scientists at University of North Carolina analyzed mtDNA of fish labelled as red snapper, which by US law can only be applied to a Caribbean snapper species, Lutjanus campechanus. 77% (17/22) fish purchased from 9 vendors in eight states were not L. campechanus, and most were species from other regions of the world, or could not be identified to species due to lack of reference sequences.

More recently, the availability of commercial DNA testing has enabled enterprising news stations to do their own research. Last year a Florida television station found that 6 of 11 restaurant entrees labeled as local grouper were other species, including Asian catfish and tilapia, and last month a Los Angeles television station reported that red snapper entrees at 4 local restaurants were either tilapia, catfish, or mahi mahi.  Following up on the news media, the Florida Attorney General’s office did their own testing, found 17 of 24 restaurants sold entrees mislabeled as grouper, and made legal settlements. What is needed is needed is a widely available method backed up by a reliable reference library that can be routinely applied to identification of fish and fish products in the marketplace. DNA barcoding is designed to be just that. 

The Food and Drug Adminstration (FDA) Regulatory Fish Encyclopedia (RFE) aims “to assist with the accurate identification of species and help federal, state, and local officials and purchasers of seafood identify species substitution and economic deception in the marketplace.”

The species pages include scientific and common names, pictures of whole fish and fish products, analytic gels of fish proteins, and excitingly, an empty space for reference DNA sequence information. For reliable identification, the fish reference library needs comprehensive taxonomic coverage and adequate sampling of variation within species, ie DNA barcoding. I believe the Fish Barcode of Life Initiative (Fish-BoL), which has already collected barcodes from over 16,000 specimens representing more than 3500 species, will provide a widely used tool that will benefit consumers and the many species of fish that require management or protection.  

Nature celebrates taxonomy as Carl Linnaeus turns 300, discovers DNA barcodes

A dozen articles in current issue of Nature examine the legacy of Carl Linnaeus, born 300 years ago this May. The wonderful cover illustration shows Carl Linnaeus as a modern field biologist in blue jeans and down vest, holding up a DNA barcode.  I am particularly struck by Charles Godfray’s “Linnaeus in the information age,” a wide-ranging, thoughtful, visionary and practical look at how taxonomy might evolve so as to provide the widest benefit to society. His near-term wish list includes “a comprehensive web-based taxonomic and identification resource (morphology plus DNA barcodes) for the world’s macrolepidoptera”, which “would be a wonderful lever for bringing new resources into the field.”  A central theme echoed in many of the pieces is the need for taxonomists to join together and create “big science” projects that benefit the many end-users of taxonomic knowledge.

Indomalayan All Birds converge in Singapore

600px-national_university_o.gif

A flock of 22 scientists converged on National University of Singapore on March 8-9, 2007 for a 2-day Indomalayan Organizational Meeting for All Birds Barcoding Initiative (ABBI), including representatives from India, Indonesia, Malaysia, Philippines, Singapore, Sri Lanka, Thailand, and individuals from European and North American museums with active collaborative research programs in this region.

Asian fairy bluebirdThe Indomalayan biogeographic region spans a vast area of tropical biodiversity and includes inumerable islands with high numbers of endemic species. A large scale genetic survey with DNA barcoding is likely to help lead to dramatic increases in species counts in particular and better understanding of biodiversity in general.  Additional collecting may be particuarly important in this region, as it is at present the least well-represented in frozen tissue collections. There was strong enthusiasm among regional participants, and recognition the initiative has public appeal and the potential to engage new sources governmental support.

I look forward to organizational and scientific progress in this exciting region. 

Horizontal genomics with bats and birds

Two papers in early online Mol Ecol Notes report large scale COI surveys of tropical bats and North American birds. In the first paper, Clare et al examined 840 specimens representing 87 (72%) of 121 known bat species in Guyana, each derived from vouchered specimens held at Royal Ontario Museum, including multiple individuals (range 2-74) from 73 (84%) of species. 81 of 87 species had distinct COI barcodes with average intraspecific variation of 0.6%. In the remaining 6 species, 15 distinct mitochondrial lineages were found which likely represent overlooked cryptic species. 

As most bats are small brown animals that fly around at night emitting noises that humans cannot hear, it is not surprising that some have been overlooked, and it seems probable many new species will be found lurking in museum drawers. Even in relatively bat-poor temperate regions there may be hidden diversity. It was not until 1997 that Europe’s most abundant and best studied bat, the Pipistrelle (Pipistrellus pipistrellus, Schreber 1774) was suggested to be 2 species through DNA analysis, a hypothesis confirmed by biological covariants and official species designation in 1999. 

In the second paper, Kerr et al (I am a co-author) report a continental-scale survey of mtCOI sequences in North American birds, including 2590 individuals from 643 species, representing 93% of the breeding avifauna of Canada and the United States. 94% of species had distinct barcodes, and in the remaining 6%, barcode clusters corresponded to small sets of closely-related species, most of which hybridize regularly. Fifteen (2%) of currently-recognized species were comprised of two distinct barcode clusters, many of which may represent cryptic species.

Birds being conspicuous, vocal, diurnal animals it is surprising that there are what appear to be overlooked species, even in an intensively-studied temperate region with relatively few species.  Of course barcode clusters are not proof of species status, but to my knowledge all such divergent lineages either correspond to recognized species, or have subsequently been found to show biological covariants and have ultimately been granted species status.

Sequencing of large tissue collections housed in museums can be done relatively rapidly and inexpensively. It is a challenge on how to report results in a way that communicates the genetic findings in a timely fashion without trampling on the careful procedures designed to maintain order in taxonomy.

By using tissues derived from vouchered museum specimens, these barcoding studies lay the groundwork for subsequent taxonomic study. By analyzing a standardized region, DNA barcoding studies can be stitched together to create a large-scale map of biodiversity, a horizontal genomics approach mapping leaves on the tree of life.

I see the “barcode map of genetic diversity” as analogous to an astronomical sky map that uses just a slice of the electromagnetic spectrum. It does not contain all the information necessary to understand the universe, but by focusing on one part of the spectrum it enables results from various studies to be seamlessly combined and allows both large and small scale comparisions. 

“Why every protist needs a barcode”

In February 2007 Microbiology Today, scientists report on the Barcoding Protists Workshop held in Portland, Maine in November 2006, which was attended by 40 protist experts from 12 countries (Australia, Canada, Denmark, France, Germany, Japan, Malaysia, Netherlands, Norway, Russia, UK, and USA). The workshop was co-sponsored by the US National Center for Culture of Marine Phytoplankton and the UK NERC Culture Collection of Algae and Protozoa.

According to Williamson et al, “most original descriptions for [over 200,000 named] protist species are based on light microscopy and ink drawings, not only making species identification for some groups an inherently subjective and specialist occupation, but also potentially hiding major genetic diversity.”

 

 

Workshop participants agreed unanimously that “to help resolve many of the contradictions and uncertainties in protist taxonomy, genetic barcoding is the way forward, starting with material, particularly type strains, in internationally recognized culture collections.”

 

COI characters resolve chitons, distances do also

In early online 12 Jan 2007 Mol Ecol Notes researchers from Columbia University, American Museum of Natural History, and California State University analyze COI barcode region sequences of 131 individuals representing 19 species of Mopalia chitons. Chitons are molluscs with flattened segmented shells, and most of the 860 known world species are herbivores that graze in tidal zones, although some are found at depths up to 6000 meters.  According to the authors “much of the biology of [Mopalia sp] remains undiscovered” because many “are difficult to distinguish from one another by morphology alone”, making them a good test case for DNA barcoding.

Kelly et al compared three approaches for identifying Mopalia chitons by COI. First, they used a “character based assessment called characteristic attribute organization system (CAOS)”. In this approach, a “guide tree” is generated using maximum likelihood or parsimony, and CAOS identifies sets of characters for each node in the guide tree. CAOS then attempts to assign unknowns based on these characters. If there is insufficient information Zoologische Staatssammlung Münchento assign the query sequence, CAOS stops the analysis. The authors compared CAOS to neighbor-joining distance analysis on Barcode of Life Data Systems (BOLD) site, and to BLAST algorithm. All three approaches had overall accuracy of 100% when provided with the entire data set. CAOS was superior to NJ and BLAST when a skeletonized reference set containing of 50% of the total sequences was used.  

CAOS automatically identifies diagnostic molecular characters, and this will help integrate DNA barcode data into traditional taxonomy. For practical use, diagnostic sequence differences may aid design of solid-state microarrays that detect species in environmental samples, such as the 0.1mm Mopalia mucosa planktonic larva shown here, which might be found floating in seawater, or in the stomach of a krill.    

Revealing krill diets with DNA

Krill swarm under Antarctic ice, Boston UniversityKrill are shrimp-like crustaceans found throughout the world’s oceans. The Antarctic krill, Euphasia superba, is thought to be the most abundant species on the planet in terms of biomass (500 million metric tonnes corresponding to 5 x 10^14 individuals), is a primary food source for whales, seals, and oceanic birds, and functions as a major planetary carbon sink by excreting waste that sinks to ocean floor. What does this very abundant, central-to-food-web species eat? For many animals, observation of eating behavior is impractical, and analysis of stomach contents or feces may be the only way to determine diet. However, such material may be morphologically unrecognizable. 

Antarctic krill filtering for prey, Rutgers UniversityIn August 2006 Marine Biotech 8:686, researchers from University of Tasmania and Department of Environment, Tasmania, compare DNA sequencing and light microscopy in identifying prey in stomach contents of E. superba. Passmore et al isolated DNA from stomach contents of 6 ethanol preserved krill and, using diatom-specific primers, amplified a 103 bp portion of nuclear small subunit RNA (ssRNA). ssRNA was used because at present it has the best taxonomic representation in GenBank for krill prey species. The researchers sequenced at least 50 clones from each individual krill and found 14 OTUs (operational taxonomic units), with 86% to 100% match to GenBank sequences. These results were compared to microscopic identification of diatom silica skeleton fragments in stomach contents, which involved counting 1000-3000 fragments per individual. Results were similar, although DNA analysis and light microscopy each appeared more sensitive for certain species. This study might be a best case for light microscopy because silica-skeletoned diatoms are not easily digested. As the authors point out, krill also consume a range of protozoa and small zooplankton, and the importance of these sources may be underappreciated.The authors conclude “the application of DNA diet analysis to krill warrants further investigation, particularly for prey that are difficult to study using other methods“.

This work shows the essential need for a comprehensive reference library, so far lacking. A study underway is examining mitochondrial and nuclear genes as barcodes for phytoplankton. Looking ahead, a “massively parallel” pyrosequencing approach could enable rapid and representative analysis of mixed environmental samples, such as stomach contents, without biases resulting from amplification and cloning. 

Tiny barcode identifies food plants, works on 20,000 year old DNA

In early access Dec 2006 Nucl Acid Res, researchers from 9 laboratories in France, Italy, Norway, and Denmark examine a candidate barcode gene for land plants, the group I intron in the chloroplast leucine transfer RNA gene (trnL intron). Prior research has already shown that a simultaneous or tiered multi-gene approach will be needed to distinguish among closely-related land plant species. A project coordinated by Royal Botanic Gardens, Kew aims to identify the best overall approach.

Rather than cracking the tough nut of an ideal plant barcode, Taberlet and co-authors look at a simple approach “emphasizing the point of view of scientists other than taxonomists“, and test this on food plants in archeological and industrial applications. The chloroplast trnL intron is not the most variable non-coding region in chloroplast DNA and does not differ enough to separate many closely-related plant species. On the plus side, there are robust primers which amplify the intron from diverse species. Like other group I introns, the trnL intron sequence has catalytic activity and a conserved secondary structure with alternating conserved and variable sequence domains. Taking advantage of this feature, the researchers designed primers to amplify one of the variable domains, the P6 loop. Binding sites for both the trnL primers, which amplify the entire intron, and the P6 loop primers are “highly conserved among land plants, from Angiosperms to Bryophytes“. Importantly, the P6 loop is only 10 to 143 bp and can be amplified from degraded DNA.

Using “simulated ePCR” with the large GenBank data set, trnL intron and P6 loop sequences identified to species level 67% and 19% of cases respectively. However, in many practical applications, the number of possible species that need to be distinguished is relatively small and they are taxonomically diverse. Following this reasoning, Taberlet et al tested the intron and its P6 loop on a set of 132 species found in the Arctic and 72 species representing the commonest food plants. With Arctic plants, trnL intron and P6 identified to species level 85% and 47%, respectively. With the food data set, the tiny P6 loop was sufficient to identify 78% to species level. The P6 loop was successfully amplified from a 20,000-year old permafrost sample, from human feces, and from various processed foods including detecting potato, leek, and onion DNA in dried soup mix!

This is an exciting study, and DNA barcoding will likely have multiple applications in food safety. Whether or not these exact gene regions are adopted, a standardized approach will enable widespread and inexpensive use.

COI solves leech mix-ups

Relying on morphologic species descriptions lacking DNA sequences is like diagnosing patients without laboratory tests. An experienced clinician can often make the correct diagnosis from examination alone, but laboratory tests are frequently needed to confirm or point to other causes. Many advances in medicine reflect incorporating laboratory testing into routine evaluation. Two papers on leeches suggest similar benefits to taxonomy from incorporating mtDNA analysis into routine practice.

In May 2005 Conservation Genetics 6:467 researchers at the American Museum of Natural History analyze morphology, mitochondrial COI, and nuclear ND-I sequences of genus Helobdella leeches from Austrialia, New Zealand, South Africa, Hawaii, and South America. The tale starts with a leech discovered in Germany in 1985, H. striata, re-named H. europaea in 1987. 

Authors Siddall and Budinoff found that H. europaea is morphologically and genetically indistinguishable from a leech “discovered” in Australia in 1998, H. papillornata. Including COI sequences in initial species descriptions would have prevented wasted taxonomic effort, and a species native to South America would probably not be given the unfortunate name europaea.

In December 2006 Evol Devel 8:491, scientists from University of Maryland and University of California, Berkeley, apply COI barcoding to another Helobdella leech, H. robusta, a model organism in developmental biology and one of the lucky species selected for genome sequencing. Researchers Bely and Weisblat obtained leech specimens from laboratory cultures and field sites. They found that isolates thought to be H. robusta actually represent four species, 2 of which co-exist at the same locality. The authors observe “the perils of misidentification and taxonomic confusion in the lab are numerous and costly” and conclude with a call for routine application of DNA barcoding and collection of voucher specimens to confirm identity of laboratory organisms.