Blog

CoML – 10 Years!

15 years after conception by Jesse Ausubel and Fred Grassle, the
scientific community presented the First Census of Marine Life 4
October in London. For an overview of the newly released materials
visit the CoML portal or the site of the news
release. Jesse served as
leader editor of the Highlights report.

For a more personal view of the program, read Jesse’s
poem,
The Census of Marine Life is about the total
richness of the sea
,
which serves as the foreword to
the new book, Life in the World’s Oceans: Diversity,
Distribution, and Abundance
, A. McIntyre (ed.), Wiley-Blackwell, 2010.

For a view of Jesse’s early vision of the program,
see JH Ausubel. The census of marine life: Progress and prospects. Fisheries 26(7): 33-36, 2001
and JH Ausubel. Toward a Census of Marine Life. Oceanography 12(3): 4-5, 1999

The achievements of the community are extraordinary.
The books by
Paul Snelgrove,
Alasdair McIntyre,
Nancy Knowlton
and the National Geographic map reporting the Census are printed.

So, is the 64-page Highlights report, and its 1600-word summary
translated into 10 languages. The greatly enhanced OBIS portal is up and now contains what/where
records for over 120,000 species. The valid names in the
Register of Marine Species now exceed 200,000.
The Encyclopedia of Life has pages with vetted content for more than
90,000 species and you can make EOL an Encyclopedia of
Marine Life simply by going to its Preferences
tab and highlighting “cmarine species” in the content
settings / browse classification box. Marine barcoders have DNA
identifiers for about 35,000 species. Scores of papers are appearing
in the PLoS CoML
Collections
and almost all these papers will shortly have embedded links from
species names to the relevant species page in the Encyclopedia of
Life. The overview paper for the NRIC collection in
PLoS One
has already been viewed more than 5,300 times.

Galatee’s Oceans film is an incomparable emblem for marine life,
and has so far grossed more than $80 million globally, and thus ranks
as the 4th most successful documentary of all time.

The performance stems from great ideas and determined implementation.
Every one of the 14 field projects flourished, as well as the History
and Futures projects and OBIS . The National and Regional
Implementation Committees performed superb studies and rooted the
Census in many more locales. The Education and Outreach Team, Mapping
and Visualization Team, and Synthesis Group multiplied the value of
everyone else\u2019s work. The Scientific Steering Committee and
Secretariat managed an effort of enormous complexity with endless
energy, wisdom, and focus.

The Census has far exceeded our expectations. It has gratified both
through accomplishment of tasks we anticipated and wonderful
surprises.

Ground beetles join in

On September 25, 2010, BOLD passed 1 M barcode records, and the International Barcode of Life ( iBOL) was officially launched in Toronto, Canada, with a goal of 5 M records representing 500 K species in 5 years, the largest biodiversity genomics project to date. In terms of DNA sequencing, the iBOL targets (5 x 106 barcodes x 650 bp/barcode = 3.3 x 109 bp) are equivalent to the Human Genome Project (human genome = 3.4 x 109 bp). However, whereas HGP involved sequencing DNA samples from a few individuals, the DNA barcode library is built by thousands of scientists examining thousands of individual specimens, one by one.  So a big challenge is obtaining, identifying, tracking, processing, and preserving millions of specimens.

What are recent arrivals to library? For one example, in current Frontiers Zool, researchers from Germany and US (I am co-author) report on DNA identification of Central European ground beetles (family Carabidae). This family comprises “no less than an estimated 40,000 described species that inhabit all terrestrial habitat types from the sub-arctic to wet tropical regions,” making identifications a challenge for taxonomists and non-specialists alike. Raupach and colleagues successfully amplified and sequenced COI barcodes and nuclear ribosomal DNA expansion segments D3, V4, and V7, from 344 specimens representing 75 species in 28 genera (average 4 specimens/species, range 2-13). Most specimens were preserved in 96% alcohol for 1-2 years; some were stored as dry pinned specimens for up to 12 years. 73 (97%) species were resolved by COI, whereas the 3 nuclear markers individually resolved a smaller proportion, 81% (D3), 57% (V4) and 87% (V7), and combining the 3 nuclear markers gave 95% discrimination. The one species pair with shared COI haplotypes also showed identical nuclear markers.  Two species exhibited distinct COI clusters (intra-specific p-distances 2.7%, 3.8%), 1 of which also had distinct nuclear haplotypes.

To my knowledge, this is the first taxonomic paper  with a “Klee diagram” depicting indicator vector correlations among COI barcode sequences. As developed by mathematician Larry Sirovich and his colleague Yu Zhang (Sirovich et al PloS ONE 2010), indicator vectors are digital representations of DNA sequences that “preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays” such as Klee diagram shown here. According to BOLD Taxonomy Browser, there are DNA barcodes for 495 carabid beetle species so far, so I look forward more of the remaining 39,505 or so species joining the barcode library, and dream of a comprehensive indicator vector/Klee analysis of ground beetle family.

In closing, professional and non-professional insect specialists alike may may enjoy recently released film “Beetle Queen Conquers Tokyo” by Jessica Oreck, a lyrical look at beetle and insect fanciers in Japan.

Don’t barcode alone

Barcoding is a standardized approach to DNA-based species identification. The essence of standardization is an agreement among researchers and practitioners to rely on one or a few defined gene region(s). Standardization makes it possible for researchers to work together to build comprehensive sequence libraries–it is simply not possible for any single group of researchers to collect and analyze the millions of specimens needed to establish a widely-useful reference database. And, looking at the application side, standardization enables species-level identifications without having to know in advance what taxonomic group the specimen belongs to. The standard regions so far are Hebert 2005 COI segment for animals and defined segments of matK and rbcL for land plants. Agreeing on standard barcode regions is a social as well as scientific process–achieving consensus on COI and matK/rbcL are major achievements.

In March 2010 Hydrobiologia, researchers from Indian Institute of Science Education and Research-Kolkata, India, and Plymouth Marine Laboratory, England, report on new primers for amplifying 18S rRNA as a means of barcoding marine nematodes from environmental samples. As background, nematodes are an enormous phylum of mostly tiny and often parasitic worms, including important human, plant, and animal pathogens, and are comprised of many deeply divergent lineages, challenging species-level identification. Despite their ubiquity, diversity, and biological importance, I imagine that most persons are unfamiliar with nematodes. Small subunit (SSU) rRNA (also known as 18S rRNA) is the backbone for nematode molecular phylogeny (Holterman et al 2006). For species-level identification, to my knowledge no single standard has emerged (Blaxter et al 2005, De Ley 2005). SSU/18S rRNA often does not distinguish among species, and so far it has been difficult to reliably amplify COI barcode region from nematodes, presumably due to sequence diversity at primer binding sites. If not COI, then standardizing on a nematode barcode will involve researchers agreeing on defined segment(s), perhaps somewhere in the 7.2 kb ribosomal RNA gene complex.

Back to the paper under discussion–Bhadury and Austen compared two 18S rRNA primer sets: one, previously described (by same authors), which amplifies approximately 345 bp near the 5′ end of 18s rRNA gene, and a new set, which amplifies a 427 bp segment from near the middle of the gene. According to my analysis, these two amplicons, which each represent about 1/5 of 18S rRNA gene, have no overlap. Why select a segment of 18S rRNA as a potential barcode, given that full-length 18S is known to show limited species resolution? In the 2006 paper cited above, the authors explored possible DNA barcoding loci, reporting that “further evaluation with the 28S rRNA, 16S rRNA and COI genes was abandoned as a result of unreliable PCR amplification with several representative marine nematode taxa.” Designing broad-range primers for barcoding nematodes is certainly challenging; this 2006 analysis, based on single specimens of 26 nematode species in 13 families, seems too sparse to make useful conclusions. As to species resolution with 345 bp 5′ 18S fragment, although the abstract states “over 97% of specimens sequenced were correctly assigned,” this turns out to refer to assignments at species OR genus level, and by my reading includes cases that matched to two different genera (with identical 345 bp 18S sequences).

To evaluate the 18S primers, DNA was isolated from two 0.5 g samples of estuary sediment collected in New Jersey, pooled, amplified, and cloned. 60 and 40 clones generated with first and second primer sets, respectively, were sequenced. From first set, 16 haplotypes (comprising 45 clones) showed 89-97% BLAST identity with known nematode sequences and the remaining 4 haplotypes (15 clones) were most similar (88-96% BLAST identify) to non-nematode 18S sequences. This led the researchers to design the second set of primers to reduce co-amplification of non-nematode sequences. The second set produced 6 haplotypes (40 sequenced clones); all were similar or identical (90-100% by BLAST) to published nematode sequences.

Designing primers that selectively amplify barcodes from certain taxa is important in some situations, particularly when analyzing mixtures, such as environmental samples as done here, and also to selectively amplify hosts vs parasites, or ingested DNA in stomach contents vs organism, for example. The authors conclude that “the databases…need to be populated with new full-length 18S rRNA nematode sequences from different biogeographic locations.” More data is always good, but it remains to be seen where efforts should be placed. In nematodes it may be there is a trade-off between having broadly-applicable primers and achieving good species resolution; here more exploration is needed. Agreeing on barcode region(s)s might help lift nematodes, which likely outnumber insects, from obscurity!

Addendum 10 sept 2010 4:10 PM: Dan Janzen points out that there is more to barcoding standards than the above might imply. To whit, the COI barcode is a precisely defined 648-bp segment of COI, and, for inclusion in reference library, barcode sequences need to be accompanied by voucher specimen information, bidirectional trace files with a minimum quality score, and primer sequences.

Bees conduct floristic survey

As in last week’s post on what deep-water sharks eat, DNA-based species identification helps reveal how animals live, not just what species they are. Diet analysis can also provide a survey of what prey/food species are in the local environment. In April 2010 Diversity, researchers from Université Grenoble, France, apply standardized DNA identification targeting chloroplast trnL intron P6 loop and massively parallel sequencing to examine plant DNAs in honey. The traditional approach for determining geographic and botanical origins of honey is microscopic examination of pollen, which requires expert training.

As previously described, the trnL P6 intron is tiny (10-143 bp) and has highly conserved flanking sequences, enabling successful amplification of DNA from many or most plants, including from degraded samples. The major disadvantages are relatively low taxonomic resolution, which is improved if sequences are matched to local rather than global flora, and a modest reference library.  Interpreting PCR-based results from mixed samples can be complicated, as there may be preferential amplification of some sequences and not others.. To my knowledge, this has not been studied for trnL P6 approach in general or as applied to honey in particular.

Valentini and colleagues extracted DNA from 10 mg samples of honey (one from a commercial “wild flower” blend and a one from local Pyrenean region) using a standard kit (Qiagen), amplified the P6 loop with broad-range primers, and performed pyrosequencing on a Roche Diagnostic G20 system. Different nucleotide sequence tags were applied to the two samples, enabling both to be analyzed in a single pyrosequencing run; the authors point out that tagging could be expanded to enable analyzing hundreds of samples in a single run. A total of 3,671 and 2,191 sequences represented at least 3 times were obtained from Pyrenean and mixed wild flower honey, respectively, which were matched to 22 and 26 plant taxa, respectively. In terms of taxonomic resolution, these were mostly family or generic level assignments: 9 families/subfamiles/tribes, 7 genera, and 6 species (Pyrenean), and 14 families/subfamilies/tribes, 8 genera, and 4 species (mixed wild flower). In both samples, the five most abundant taxa comprised about 75% of total sequences.

Valentini and colleagues note that “several of the plant taxa identified were not the result of nectar collection” (moss, fern, pine), and were presumably due to wind transport from nearby plants. The fern species identified, Athyrium vidalii, which comprised 1.9% of sequences, is distributed in China, Japan, Korea, and Taiwan, evidence for the geographic origin of the honey. Documenting geographic origin of honey products is of commercial interest.

A primary advantage and rationale for DNA barcoding is that standardizing on one or a few regions enables a comprehensive reference library and broadly-applicable testing methods. The trnL P6 target utilized in the present study is not part of the published community standard of rbcL + matK targets (A DNA barcode for land plants, Hollingsworth et al PNAS 2009), so it remains to be seen when this will be widely used. In any case,  authors conclude that their method is “fast, simple to implement, more robust than classical methods” and “opens new perspectives in the analysis of honey diversity.” I look forward to learning more!

Knowing the unknowable

350 years ago Anthony van Leeuwenhoek explored the living world around (and within!) him using tiny, powerful, single lens microscopes. He discovered “tiny animacules” including what we now know as protozoa and bacteria, and detailed structures of plant and animal tissues.  In a similar way, DNA study can reveal features of the living world that would otherwise remain unknown.

In July 2010 Deep-Sea Research (not open access) investigators from National Institute of Water and Atmospheric Research, New Zealand, report on what deepwater sharks eat. Dunn and colleagues analyzed stomach contents of 194 sharks from 6 species (14-50 individuals per species) collected in bottom trawls on the Chatham Rise, a relatively shallow area and important fishing ground that extends 1000 km east of New Zealand, at depths of 200-800 m (note: at these depths ocean is nearly dark and does not support photosynthesis). FYI, the sharks studied are Kitefin shark (Dalatias licha), Deepwater spiny dogfish (Centrophorus squamosus), Roughskin dogfish (Centroscymnus owstonii), Deepwater dogfish (Centroselachus crepidater), Lord Plunket’s shark (Proscymnodon plunketi), and Eastern school shark (Galeorhinus galeus).

Perhaps related to the trawl capture method, the individual sharks were relatively small, ranging from 0.38 – 1.6 m depending on species. Prey items were first subject to morphologic identification, and DNA barcoding (using standard primers for full-length 650 bp COI barcode) was performed only if items were visually unrecognizable. Of the 118 sharks with non-empty stomachs, 43 (36%) had prey identified by morphology alone, 28 (24%) by DNA alone, 37 (31%) by both, and in 10 (8%) no items were identifiable by either method. In addition to a variety of fish, predominantly Hoki (Macruronus novaezelandiae) , the most abundant and commercially most important fish on the Chatham Rise, prey items included other shark species, shrimp, octopus, and squids.

In this report, Dunn and colleagues describe what sharks living in near-darkness in the deep ocean eat. Absent DNA, most of this information would be unknowable. The authors conclude that “DNA barcoding can be used to identify prey, and can greatly increase the rate of data accumulation,” noting “the current cost of survey time vastly outweights that of DNA barcoding of prey, making DNA barcoding a cost-effective way of increasing sampling rate”.

Naming names faster (addendum)

In yesterday’s post I placed “integrative taxonomy” on a spectrum with morphologic taxonomy at one end and “DNA taxonomy” as applied to eubacteria/archaebacteria at the other. Mehrdad Hajibabaei pointed out that bacterial diversity is not partitioned into species in the same way it is in animals and plants. Eubacteria/archaebacteria have relatively fluid genomes with frequent exchange of DNA among lineages. Fewer than 10,000 bacterial species have been named, although their diversity is certainly vaster than all eukaryotes (2 million named species).  Thus DNA-based classification of bacteria, at least as presently applied, does not meet the goals of DNA barcoding, which aims to capture species-level differences.

Naming names faster

Species are the units of biodiversity. Discontinuities in biological variation sort organisms into discrete groups that we recognize as species, and so gathering data on differences among organisms is the necessary first step in understanding the diversity of life. Here DNA has singular value–all organisms have DNA and some genetic loci are widely-shared, enabling direct comparisons across the  diversity of multicellular life. Barcoding targets widely-shared gene sequence(s) that nonetheless differ among most closely-related species (COI for animals and rbcL+matK for land plants), providing broadly applicable metrics for mapping the discontinuities that represent species. Large-scale DNA barcoding thus offers for the first time a macroscopic view of biodiversity.

This sounds straightforward enough, but naming species, like medical diagnosis, is a process requiring human judgment. A taxonomic expert generally focuses on one or a few species or potential species at a time, sifting through morphological, ecological, behavioral, and DNA data and making inferences about the evolutionary past. It generally takes years or decades between specimen collection and publication of a new species description, and my impression is that most specimens in museum collections including frozen tissues have never been scrutinized in detail sufficient to determine whether they represent new species. Given that a high-throughput laboratory can generate a hundred thousand barcodes in a year, there are opportunities for new workflows.

In May 2010 Frontiers Zool, researchers from Uppsala University, Sweden, and Technical University of Braunschweig, Germany, look at how we might incorporate the flood of DNA data, outlining an approach they (and others) call “integrative taxonomy”. As current practice in taxonomy already involves integrating different kinds of data (morphology, behavior, range, DNA), I take this term to mean an approach somewhere between one primarily based on morphology (“traditional taxonomy”) and one primarily based on DNA (“DNA taxonomy”), such as that for eubacteria and archaebacteria. Padial and colleagues review the recently revitalized scientific discussion about species delimitation involving population biology and phylogenetics, noting “what matters for the study of speciation matters for taxonomy as well.” They call for a flexible approach including the possibility of “recognition of a species on the basis of a single set of characters”, which could be DNA barcodes.  Near the end, they address the big challenge, which is that DNA studies, particularly DNA barcoding, “are revealing units that might represent potential new species at a faster pace than results can be followed up for taxonomists.” Padial and co-authors review various protocols used for naming “candidate species” and conclude “standardization of such schemes across taxonomic groups of eukaryotes would be clear progress for data retrieval systems.”  As described in more detail here previously, a starting point for discussion of the preferred format for standardizing provisional names was recently proposed (Schindel and Miller, System Naturae 250, Chapter 10), based on the scheme currently used by CHAH (Council of Heads of Australian Hebaria). This system of “taxon labels” (as distinguished from “taxon names”) meets the criteria of uniqueness, stability, and non-confusion with formal taxon names.

At present, our knowledge of biodiversity is built around a catalog of taxon names, annotated with DNA data if available. I imagine the future catalog as being a DNA (barcode) map, annotated with taxon names if available. Some parts of the map, such as for birds, will be heavily annotated, and others, such as for nematodes, will have few formal names, and instead will have taxon labels generated by automated clustering algorithms. In some cases, the DNA data will be derived from individual specimens, backed up by museum vouchers, and in other cases it will be generated from environmental sampling. Only then will we begin to see how much biodiversity is unexplored.