Visualizing birds so far

In 2004 PLoS Biology, Hebert and colleagues (I am a co-author) observed that differences in COI barcodes among 260 species of North American birds were generally much larger than those within species, with the result that “distinguishing species was generally straightforward.” In addition, we noted 4 birds with large intraspecific divergences that likely represented overlooked species. Our study included only about 1/50 of world birds (out of approximately 10,000 named species) and modest sampling of differences within species (multiple individuals (average 2.4, range 2-10) for 130 species), so not surprisingly some scientists wondered about the generalizability of the findings in birds in particular and animals in general. In an accompanying commentary, Cicero and Moritz wrote “…a true test of the precision of mtDNA barcodes to assign individuals to species…would require that all members of a genus be examined, rather than a random sample of imprecisely-defined close relatives, and that taxa be included from more than one geographic region.” They concluded their essay:

“But to determine when and where this approach [i.e., DNA barcoding] is applicable, we now need to discover the boundary conditions. The real challenge lies with tropical taxa and those with limited dispersal and thus substantial phylogeographic structure. Such analyses need to be taxonomically broad and need to extend beyond the focal geographic region to ensure that potential sister taxa are evaluated and can be discriminated. There is also the need to examine groups with frequent (possibly cryptic) hybridization, recent radiations, and high rates of gene transfer from mtDNA to the nucleus.”

As of today, the BOLD taxonomy browser at Phylum Chordata, Class Aves (www.boldsystems.org/views/taxbrowser.php?taxid=51 indicates over 24,000 barcoded avian specimens representing over 3,800 avian species, nearing 40% of world avifauna. By my count there are over 30 publications on DNA barcoding in birds, including large surveys in North America, Scandinavia, Argentina, Brazil, and Korea.

In the next few posts, I try to look at what we have learned so far, with an emphasis on visual representation. The short answer to the technical question of barcoding effectiveness in birds is that the early observations are borne out, with a few interesting exceptions. My rough summary is that about 95% of bird species can be distinguished by DNA barcode, the remainder are sorted into pairs or small sets of closely-related species, and about 10% of named species show large divergences that likely represent unrecognized species. This last observation brings up an important point–taxonomy is undergoing constant revision, even in a group as well-studied as birds. For example, over the past 30 years, about 10% of the roughly 2000 bird species on the American Ornithologists’ Union Check-list have had species limits revised, and this process is not near closure. So when we compare to sequence data to taxonomic classification, we have to keep in mind the latter is a moving target.

Looked at more broadly, the central finding is that mtDNA sequence differences in birds partition into distinct clusters. Most mtDNA sequence clusters correspond to a single named species, and the ongoing process of taxonomic revision is tightening the one-to-one correspondence between clusters and species designations. In fact, a person with no knowledge of avian biology could closely approximate species numbers and limits simply by sorting COI barcodes into sequence clusters. Of course, the concordance of species limits and mtDNA sequence differences is not a new observation (see for example Avise et al 1987, 1999; Moore 1995), but it is now backed up by much more data. An important but unsolved question is why mtDNA partitions into narrow clusters in birds and other animals. One or more of the proposed mechanisms may turn out to be correct but none has been proven so far.

To begin visualization survey:

Pairwise sequence differences within most bird species are small, usually much less than 1%. (“Pairwise sequence differences” means comparing each individual to every other individual of the same species; for n individuals there are n(n-1)/2 comparisons.) In looking at this data, I think that the absolute scale is important. One of the great benefits of working with a standard barcode region is that we can compare results across diverse taxa. To get into particulars, genetic differences are roughly similar in mtDNA protein coding genes (e.g. COI, cytb), but divergences in the mitochondrial control region are an order of magnitude greater.

Here is a look at MAXIMUM intraspecific distances (K2P metric) among some of the larger geographically-based surveys published so far.  (Most of this information can also be found in published papers cited below.) The aim is to see what we can learn from outliers to the general observation of narrow differences within species. For these illustrations, I went to the Public Projects section of BOLD www.barcodinglife.org, selected a project (Birds of North America Phase II, Kerr 2007; Birds of Argentina Phase I, Kerr et al 2009a; Birds of Scandinavia, Johnsen et al 2010; Birds of the eastern Palearctic, Kerr et al 2009b) and ran a “Nearest Neighbor” analysis with BOLD software, which calculates average and maximum intraspecific distance, as well as identity of and distance to the “nearest neighbor”. The results were copied and pasted into an Excel spreadsheet, sorted by increasing maximum intraspecific distance, and displayed in a graph as shown below (note different y-axis scale for eastern Palearctic). The total number of species are noted on the x-axis; yellow marks those with >1% maximum intraspecific distance. The curves are roughly similar among the regions except that the proportion >1% differs.

In the next post, I look more carefully at the apparent outliers. What makes them different–biology or taxonomy?

Note added 23 March 2011: Kevin Kerr points out that “eastern Palearctic” refers to entire region east of Europe, and thus the eastern Palearctic survey referred to above includes sites spanning most of Russia, Kazakhstan, and Mongolia (not just the eastern half of Russia as highlighted in map).

Note added 24 March 2011: Map corrected to show collecting region for eastern Palearctic survey.

DNA barcoding maps unknowns in Iraq

Rivers and streams are listening devices for watersheds. The best way to assess watershed health is to survey freshwater life downstream. In particular, benthic (bottom-dwelling) macroinvertebrates (visible without magnification) are widely-used indicators of freshwater quality (see for example US EPA page). A challenge for freshwater biomonitoring programs is to rapidly identify the multitude of benthic invertebrate species potentially present in a water sample, and to repeat that for hundreds or thousands of samples. Now imagine you need to assess freshwater quality where few taxonomists have ever ventured.

In January 2011 J N Am Benthol Soc, researchers from U.S. National Museum of Natural History, The American University of Iraq-Sulaimani, and University of Guelph report on DNA barcoding to facilitate biomonitoring in the headwaters of the Tigris River, Iraq. Geraci and colleagues focused on Trichoptera (caddisflies) a group widely used as water quality indicator species. Trichoptera are small, winged insects (approximately 12,000 named caddisfly species worldwide) related to moths and butterflies, with larval stages that develop in freshwater. Sometimes emulated by trout fishermen making lures, caddisfly larvae construct “mobile homes” by gluing together bits of stone, sticks, or other material, with architectural details that help distinguish species. At the time of this study, the world literature on Iraqi Trichoptera consisted of 3 published reports describing 6 species in 7 genera (some larvae were identified only to genus) based on specimens collected between 1919 and 1987.

As part of a larger “Key Biodiversity Areas” survey (for more info, see note below) conducted from 2007 to 2009 by Nature Iraq Organization, the researchers collected benthic macroinvertebrates at twenty sites in three watersheds of the Tigris River during May-June 2008 and January 2009. Four to six replicate samples were obtained at each site, samples were washed with 70% ethanol using a 0.5 mm mesh in the field and again in the laboratory, and caddisfly larvae were removed and stored in 70% ethanol. At two sites adults were collected and placed in separate vials with 70% ethanol. Larvae were sorted using keys for Nearctic and Palearctic Trichoptera, and adults were identified to genus following a key to European species.

Following morphologic sorting, DNA barcoding using standard primers (LepF1/LepR1) was performed on 144 larvae and 6 adults, focusing on individuals in family Hydropsychidae as these were collected in large numbers. Successful amplification was obtained on the first pass with 81.3% of specimens, which is a nice demonstration of robust nature of DNA itself and of amplification protocols, as storage conditions were not what is considered optimal (optimal storage for insects usually means dried immediately after collection, whereas these specimens were stored in 70% ethanol for 2-3 years at the time of analysis).

DNA barcodes of Iraqi specimens were compared to the existing Trichoptera barcode library which so far includes records for about 2500 named species (~19% of world fauna) and many undescribed species. Combining morphologic and DNA data, the researchers identified 16 species in 11 genera and 9 families, with only one of the putative species matching to a previously named organism. There is a lot of Trichoptera taxonomy and molecular phylogeny here, but I will skip to conclusion. This study demonstrates how DNA barcoding, applied to a “virtually unknown fauna”, can build on an existing barcode library to speed species recognition, establish a practical identification method for general use, and link new discoveries to known genera and families.  The authors conclude that “DNA barcoding of benthic macroinvertebrates will be crucial in developing countries that are trying to overcome a lack of knowledge of aquatic-insect taxonomy and trained taxonomists. DNA barcoding will help aquatic scientists in these countries generate the empirical data needed to implement sound bioassessment and monitoring protocols to protect and manage their water resources.”

More generally, I think we can stop looking back longingly at past taxonomic practices and, DNA tools in hand, start helping society and science discover just what is out there, with all the intellectual excitement that entails.

Note added 20 March 2011: Co-author Mohammed Al-Saffar writes: “Key Biodiversity Areas (www.natureiraq.org/site/en) is an ongoing project conducted biannually by Nature Iraq, and we (Nature Iraq, Miami University, Smithsonian NMNH, Guelph University, as well as Clemson University) are in the process of working on the DNA barcoding of all the insects important for monitoring water quality in Iraq such as the Mayflies, Dragonflies, etc.”

DNA reveals panoply of tiny fish

Starksia is a genus of tiny (most less than 2 cm) blennoid fishes found in rocky inshore areas and coral reefs along the Atlantic and Pacific coasts of North and South America and the Caribbean. In 11 February 2011 ZooKeys (open access) researchers from Smithsonian Institution, Ocean Science Foundation, and Nova Southeastern University describe 7 new western Atlantic Starksia species which they first discovered through DNA barcoding.

DNA barcoding revealed divergent clusters within four previously described species and careful re-inspection revealed morphologic characters associated with each genetic cluster.  It is interesting that many of the distinguishing characters are around the head, which may fit with fact that these fish often spend their time largely hidden with only the head exposed. The ZooKeys article is about 51 pages, or about 7 pages per new species, which I think is about average for a species description. If there were a similar printed key for all fishes (about 25,000 named species so far) that would be 175,000 pages long, which is one reason that methods for non-specialists are needed! Of course keys can be posted on the web, as this is, but it is still a challenge to find the right key, especially if you don’t already have a good idea of what you are looking at.

I was surprised that the key did not include barcode sequences of the holotypes (primary specimen chosen to represent the species) (of course these are in GenBank). Even better might be a table of the diagnostic barcode differences among these species, a molecular key.  To try this out, I downloaded the Starksia sequences from Public Projects section of BOLD www.barcodinglife.org, opened in MEGA (free sequence analysis software available at www.megasoftware.net), highlighted all positions that differed among the set, and exported these to Excel including the position numbers which are shown at the top. An excerpt of the output is shown below.

This sort of display could be useful including in a legal setting when you need to document the basis for identification by barcode. The NJ tree gives the right answer of course but it is an abstract representation of the data. A table such as above would show the actual nucleotide sequence differences which are used to generate the tree.

Addendum: I meant to include this very neat feature reading the ZooKeys article online, which is a menu of links that appears if you place your cursor over any species name!

Mexico barcode surprises

In December 2010 Mitochondrial DNA special issue (open access!) devoted to the Mexican Barcode of Life Initiative (MexBOL), Mexican scientists and colleagues report on barcoding explorations of their megadiverse fauna and flora. A few highlights:

Martínez-Salazar and León-Règagnon from University of Guelph and Universidad Nacional Autónoma de México respectively, examined two morphospecies of Langeronia lung flukes that parasitize Mexican frogs, finding three deeply divergent clusters (approximately 8% uncorrected sequence divergence among specimens from the different regions and 0.3% within). Surprisingly, these clusters were observed in both morphospecies (and did not differ among host species). Based on their results, the researchers conclude that the morphospecies are conspecific, perhaps representing alternative developmental pathways, but it may also be that there are species-level biological differences among the clades not yet recognized. Of note, the 368 bp COI fragment analyzed only partly overlaps the standard barcode region; primers effective for Trematodes are needed.

Cervantes and colleagues (same institutions as above) demonstrated that Common Opposum (Didelphis marsupialis) and Virginia Opposum (D. virginiana), which are sympatric (live in the same area) in Mexico, are readily distinguished by COI barcode (average K2P distances between species are approximately 8%, and within are 1.5%). These species have diagnostic skull morphology but external characters are unreliable, making field identification inaccurate and even museum specimens can be misidentified, including 4 museum specimens in this study.

Zaldívar-Riverón and colleagues from Mexico, Canada, and Argentina applied DNA barcoding to braconid wasps collected during three field trips during 2009 in the Chamela-Cuixmala biosphere reserve, near the Pacific coast. Braconidae is an extraordinarily rich (50,000 -150,000 species) family of tiny parasitoid wasps that attack butterfly larvae, with many species exquisitely specialized to a single host. The researchers obtained barcode sequences from 407 of 483 specimens, and applied computer software (Yule coalescent model) to estimate how many species were present, which turned out to be 185! I’m guessing they sorted specimens to select different morphospecies before sequencing, as it seems improbable that there could be 185 different species among 407 randomly-collected specimens. In this short report, the researchers did not comment on how many of the “barcode coalescent species” correspond to known wasp species. The rate of species discovery did not plateau over the course of the study, pointing to many, many more braconid wasps in just this one area.

At the beginning of the barcode initiative, there was worry from some taxonomists that it wouldn’t work. After 8 years, we know now that a comprehensive library built around taxonomic reference specimens unambiguously names 95% of animal species from mayflies to mammals, and resolves the remainder into small sets of closely-related species. Looking ahead, it seems obvious that one or another analytic approach, such as coalescent modeling described above, will enable construction of a provisional species and higher-level taxonomy from barcode data alone which will be particularly useful for impossibly diverse or poorly-studied groups, such as nematodes. Experts will improve this draft taxonomy as other information becomes available.

I have long thought that the biggest scientific challenge raised by DNA barcoding is not that it doesn’t work to distinguish some closely-related species, but that it works too well–it reveals biodiversity is much more finely divided and highly-specialized than we knew. Even in birds, the best studied large taxonomic group, a comprehensive DNA barcoding survey demonstrated that 24% of bird species that live in Europe and North America are comprised of isolated populations that have been diverging for more than a million years, likely representing distinct species (Johnson J Ornithol 2010; open access). To my mind, the big picture results so far are an exciting provocation–we need to better understand specialization–how do organisms navigate their environment—for example what signals (chemical, behavioral, acoustic, electric, visual?) enable a wasp to find and distinguish one butterfly larvae from another, or to determine whether the larva is already parasitized? Animals can be viewed as highly-discriminating and sensitive detectors of other life forms. A genetic approach might identify parts of the nervous system or sensory system that enable these feats. With better understanding we might construct highly-specific biosensors, say to detect pathogens.

Addendum: Correcting my supposition about whether specimens were sorted prior to barcoding, Dan Janzen tells me that high species counts are the norm when collecting braconid wasps in tropical sites, such that 185 braconid wasp species from 483 specimens is not unusual.

Breath tests for DNA

In August 2010 PLoS ONE, researchers from University of Queensland, Georgetown University, and National Aquarium look at feasibility of genotyping cetaceans (whales, dolphins, and porpoises) by sampling blow, the exhalations from blowholes. The standard method for collecting cetacean DNA, dart biopsying, is considered inappropriate in some settings, particularly for young animals. Blow sampling has been used to assess disease in free-ranging cetaceans (Acevedo-Whitehouse et al Anim Cons 2009).

In the PLoS ONE report, Frère and colleagues studied six bottlenose dolphins (Tursiops truncatus) housed at the National Aquarium from which they were able to collect both blood and blow samples. Blow sampling involved holding a 50 mL polypropylene tube inverted over the blowhole of “dolphins trained to exhale on cue.”  Tubes were placed on dry ice for transport to the laboratory, where the presumably adherent blow material was resuspended in 500 ?L of TE buffer (this worked better than ethanol), and centrifuged at 3000 rpm for 3 min. Excess TE was removed, and DNA was extracted using a Qiagen DNeasy Blood and Tissue Kit. For all six individuals, mitochondrial and microsatellite DNA profiles from blow matched those from blood. The researchers applied this approach to a wild population of bottlenose dolphins in the eastern gulf of Shark Bay, Australia, using “a modified embroidery hoop with sterile filter paper stretched over its centre,” with successful recovery of mitochondrial DNA from one individual so far.

Looking ahead, small, remote-controlled devices might be used for sampling, as were employed in filming cetaceans in Oceans. There may also be applications of DNA breath-testing in land animals (see Schlieren image of extensive turbulent flow following a cough). More generally, the increasing sensitivity of DNA techniques opens a dizzying array of possibilities for DNA-based identification. For example, forensic laboratories now routinely employ “touch DNA” methods sensitive enough to detect the tiny number of cells that are routinely shed when we touch objects, and the presence of amphibians in a pond can be determined by DNA testing a 15 mL water sample (Ficetola Biol Lett 2008).

Expanding access to DNA secrets

When Roger Tory Peterson’s “A Field Guide to the Birds” was published in 1934, it opened the door to a multitude of persons being able to identify birds, helped create small industry of birding guides and optics, and was a driving force in the much larger social transformation in awareness of the natural world and human impact. I see the library of DNA barcodes as a (near) universal field guide to the immense diversity of multicellular life, with similar potential for large scientific and societal benefits. Of course the library is not complete (so far, >1 M records, >92 K species), but enough work has been done in diverse taxonomic groups to be confident that a library of standardized, short DNA sequences linked to named, vouchered specimens (i.e. DNA barcodes) will enable species-level identification of most multicellular animals and narrow identification to one or few plant species.

So far, it is mostly only scientists who have direct access to DNA secrets. A future in which non-professionals analyze DNA is creeping closer. You can mail a cheek swab to a DNA lab to reconstruct your personal ancestral genealogy ($150) or check paternity ($400). Whole genome sequencing is available too, but to my reading this is too expensive for now ($20,000) and the results and interpretation are not generally useful. Kits for DNA analysis are already in use in high school classrooms and, closer to home, educational DNA barcoding looks to be around the corner. In December 20, 2010, Bio-Rad Laboratories, a scientific supply company, announced a partnership with Coastal Marine BioLabs (CMB) to develop “DNA barcoding instructional activities for classrooms.” CMB has been active in engaging high school students in generating and submitting reference data to the BOLD database. I expect the potential market for DNA barcoding kits in education is large.

Cool new barcode app

The US Global Positioning System (GPS), consisting of 24 to 32 satellites in medium earth orbit, cost $32 billion to develop and is supported by an annual budget of $1 billion. When the high resolution GPS signal was first made available to the public in May 2000 by President Bill Clinton, I imagine that few persons anticipated how useful it would be. Ten years later there are numerous, diverse applications, ranging from a smartphone app for finding the nearest post office in Australia to tracking animals across the Pacific. Like GPS, the Barcode of Life Database (BOLD) is a public, large-scale technology infrastructure resource. Similar to the trajectory with GPS, I expect that over the next 10 years BOLD will enable an expanding array of applications useful for students, consumers, commercial entities, regulators, researchers, and probably some just for fun.

In November 2010 Molecular Ecology (request pdf from author) researchers from University of Guelph, Canada and Institut National de la Recherche Agronomique, France report on “molecular analysis of parasitoid linkages (MAPL)”. As background, parasitoid insects–many or most are wasps (order Hymenoptera)–lay eggs in the larvae of other insects, primarily Lepidoptera (butterflies and moths) and  Diptera (flies). Host mortality may exceed 90%, and many parasitoids serve as useful biocontrol agents for agricultural pests. Parasitoid wasps are generally tiny and hard to distinguish morphologically, and identifying hosts may take years of patient observation. Recent molecular data show unexpected diversity and host specificity, i.e. many parasitoid species thought to be generalists are in fact comprised of multiple distinct lineages each limited to a single host.

In this study, Rougerie and colleagues looked at whether it was possible to identify the hosts by looking for leftover DNA in the abdomen of adult wasps. As an aside, the general approach in building up the barcode reference library for animals is to use broad-range primers that amplify COI from a wide taxonomic array of specimens. Now that parts of the library are established, it is possible to make use of the accumulated data to design primers that amplify specific taxonomic groups. Such taxon-restricted primers can help address interesting questions. In this study, researchers utilized two sets of primers, one set (primarily LepF1/LepR1) that amplified COI from the wasps and one set (LepF1/MLepR1) with a reverse primer that was specific to the potential hosts, namely Diptera and Lepidoptera. The first set successfully amplified COI from single legs of 297 adult wasp specimens thought to comprise more than 90 species and 20 genera. Using the same DNA extracts, the host-specific primers yielded PCR products from only 9 (3%) of these specimens, demonstrating good selectivity. Rougerie and colleagues then prepared DNA extracts from the abdominal segment of 3 species of hand-reared wasps (so that the host species were known), collected immediately after emergence. 29 (24%) of 120 specimens yielded readable PCR products, of which all except one matched to the known lepidopteran host species.  The authors conclude that “MAPL has immediate applications in the agricultural sciences by facilitating selection of biological control agents” and that it “will drastically accelerate the registration of host-parasitoid associations and that the development of similar approaches for other orders of insects with complete metamorphosis will  be equally productive.” I look forward to these new apps!

How to make an indentification machine

Successful automation often involves machines that carry out tasks differently than persons. For example, a Coulter counter (developed by Wallace H. Coulter, an American engineer), analyzes blood cells by electrical charge, producing a detailed report of red and white cell types faster and more cheaply than does a technician examining a blood smear under a light microscope.  As another case, machine identification of commercial products is enabled by a UPC bar code, which represents a product name in a digital format that can be “read” almost instantaneously by a laser scanner. In a similar way, DNA barcoding “reads” the digital code of DNA, associating that with species names in a reference database, opening the door to fully or partly automated identifications. In 9 September 2010 Nature, scientists from London Natural History Museum, Louisiana State University, and University of Plymouth, UK, propose a different route to automate taxonomic identification, namely, teaching computers to do morphologic pattern recognition. Now that we are on the threshold of “anyone, anywhere, anything” identification with DNA barcoding, this seems a step backward.

I see three major challenges that limit any morphology-based identification system: naming an organism from bits and pieces, recognizing look alikes and life stages, and the diversity of diagnostic features requiring specialized equipment. On the other hand, DNA is the same whether from an intact specimen or an unrecognizable stomach fragment, readily distinguishes look alikes in any life stage, and can be analyzed using the same equipment regardless of specimen. More generally, at the end of the day, little scientific insight will have been gained from a system that distinguishes life forms by the multitidinous particulars of appearance, whereas a library of DNA barcodes linked to named specimens offers a broad view of species-level differences across the diversity of life.

According to MacLeod and colleagues, “a [DNA] bar code isn’t useful until the reference species has been identified by experts”. This makes no sense to me. All large barcode surveys of animals, from ants to fish, have revealed hidden genetic divergences, in many cases leading to recognition of new species.  In fact, DNA barcoding is fast way of screening existing collections for unrecognized species. In this same section, as part of discounting a DNA approach, they state “researchers frequently need to identify non-living objects as well as living ones”. I don’t understand how this is an objection, since, for example, DNA barcodes from ancient bone fragments have been used to define species of extinct flightless Moa (Lambert et al J Heredity 2005).

I know from iPhoto’s remarkable ability to recognize individuals that computers are getting better at pattern recognition. Further development focused on taxonomic specimens may lead to useful tools. However, this seems unlikely to lead to a widely applicable automated system. In a study cited by the authors, phytoplankton identifications by 16 marine ecologists were compared to those with DiCANN, a machine learning system (Culverhouse et al Marine Ecol Prog Series 2003). The authors of that study conclude what is likely to be generally true about morphology based identification:   “In general, neither human nor machine can be expected to give highly accurate or repeatable labeling of specimens”.

One biodiversity database to the next

Jumping between biodiversity databases is getting easier. For example, typing in “Atlantic cod” at Ocean Biogeographic Information System (OBIS) takes you to a Gadus morhua species page summarizing 616,444 records, a zoomable map of its geographic range based on specimen collection locations, and direct links to G. morhua pages in other databases, including, for example, Barcode of Life (BOLD), Encyclopedia of Life (EOL), Catalog of Life, World Register of Marine Species (WorMS), and Google images, among others. Having all that, inspired by Matt Damon’s character in The Bourne Ultimatum, we want to take more leaps–perhaps to G. morhua pages in Arkive, Biodiversity Heritage Library, FishBase, and/or GenBank?

Something new is having links to Encylopedia of Life species pages embedded in research articles (so far in some papers in PLoS ONE; for an example, see shark names in Ward-Paige et al 2010 PLoS ONE). Having direct links to literature sources is a wonderful enhancement of research articles, and I believe that species name links will be equally valuable, particularly for biodiversity literature, so I hope this catches on. Species name links have potential to increase the audience and impact of research papers, since many otherwise interested persons will not recognize scientific names or will be entirely unfamiliar with the organisms being studied.

Ground beetles join in

On September 25, 2010, BOLD passed 1 M barcode records, and the International Barcode of Life ( iBOL) was officially launched in Toronto, Canada, with a goal of 5 M records representing 500 K species in 5 years, the largest biodiversity genomics project to date. In terms of DNA sequencing, the iBOL targets (5 x 106 barcodes x 650 bp/barcode = 3.3 x 109 bp) are equivalent to the Human Genome Project (human genome = 3.4 x 109 bp). However, whereas HGP involved sequencing DNA samples from a few individuals, the DNA barcode library is built by thousands of scientists examining thousands of individual specimens, one by one.  So a big challenge is obtaining, identifying, tracking, processing, and preserving millions of specimens.

What are recent arrivals to library? For one example, in current Frontiers Zool, researchers from Germany and US (I am co-author) report on DNA identification of Central European ground beetles (family Carabidae). This family comprises “no less than an estimated 40,000 described species that inhabit all terrestrial habitat types from the sub-arctic to wet tropical regions,” making identifications a challenge for taxonomists and non-specialists alike. Raupach and colleagues successfully amplified and sequenced COI barcodes and nuclear ribosomal DNA expansion segments D3, V4, and V7, from 344 specimens representing 75 species in 28 genera (average 4 specimens/species, range 2-13). Most specimens were preserved in 96% alcohol for 1-2 years; some were stored as dry pinned specimens for up to 12 years. 73 (97%) species were resolved by COI, whereas the 3 nuclear markers individually resolved a smaller proportion, 81% (D3), 57% (V4) and 87% (V7), and combining the 3 nuclear markers gave 95% discrimination. The one species pair with shared COI haplotypes also showed identical nuclear markers.  Two species exhibited distinct COI clusters (intra-specific p-distances 2.7%, 3.8%), 1 of which also had distinct nuclear haplotypes.

To my knowledge, this is the first taxonomic paper  with a “Klee diagram” depicting indicator vector correlations among COI barcode sequences. As developed by mathematician Larry Sirovich and his colleague Yu Zhang (Sirovich et al PloS ONE 2010), indicator vectors are digital representations of DNA sequences that “preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays” such as Klee diagram shown here. According to BOLD Taxonomy Browser, there are DNA barcodes for 495 carabid beetle species so far, so I look forward more of the remaining 39,505 or so species joining the barcode library, and dream of a comprehensive indicator vector/Klee analysis of ground beetle family.

In closing, professional and non-professional insect specialists alike may may enjoy recently released film “Beetle Queen Conquers Tokyo” by Jessica Oreck, a lyrical look at beetle and insect fanciers in Japan.