Visualizing birds: Part 2. Distant clusters, unfinished taxonomy

In 1911, Rutherford proposed correctly that essentially all the mass of an atom is concentrated in a tiny “central charge” (what we now call the nucleus) and that the rest of an atom was essentially empty space, devoid of mass (https://en.wikipedia.org/wiki/Rutherford_model). This comes to mind in looking at results so far with birds, which overwhelmingly show that mtDNA differences are partitioned into tight clusters, and conversely most of the nearby genetic “space” is empty. In the language of evolutionary science, living organisms are narrow discontinuities without intermediate forms.

In yesterday’s post I noted that a minority of avian species exhibit large intra-specific distances. One possibility is that these represent species with a wide and more or less continuous variation, like the distribution of height in humans, for example. A quick perusal of an NJ (neighbor-joining) tree shows this is not the case. Rather, as noted in all published surveys so far, species with large intraspecific distances are composed of distinct clusters. As an alternative to an NJ tree, here is another way of looking at this data. For the illustration at left I took all species in N American project (Kerr et al 2007) with maximum distances of 2% or more, sorted sequences into sets as indicated by the NJ tree, calculated the maximum distances within each component cluster, and graphed these so that maximum distances within component clusters appear below the respective point for the species. In this analysis, all species with large intraspecific distances were composed of 2 clusters with much lower variation. In all cases, large intraspecific values reflected comparisons across the branches of the tree. One way of looking at this is that mtDNA sequence clustering is same in species with high and low maximum distances. What differs is that species with large intraspecific distances include multiple clusters.

At right is another way of looking at this. Here I used all species in Argentinian dataset (Kerr et al 2009) with maximum intraspecific distances of 1% or greater. For each species, the graph shows ALL pairwise distances ranked in increasing order, and a yellow line connects lower and upper pairwise values for each species. If species exhibit a range of differences, then there should be a more or less continuous range of pairwise values. On the other hand, if species are composed of clusters, then there will be one set of small pairwise distances from comparisons within clusters, and a set of larger distances from comparisons between clusters. With one exception (the second species from the left) large intraspecific distances reflected the presence of distinct clusters included under a single umbrella species designation.

So where are we? Can we conclude that there is a minority of species that are genetically polytypic?  One way to answer this is to look at recent taxonomic revisions in birds, taking advantage of the extremely well-documented historical record in the form of updates to the American Ornithologists’ Union (AOU) Check-list. In the next post I will look at refinements to avian species taxonomy through the lens of COI barcodes.

High school students to explore wilds of New York City with DNA, win prizes

On March 8, 2011, scientists and science educators at Dolan DNA Learning Center, Cold Spring Harbor Laboratories, announced the “Urban Barcode Project.” From the website www.urbanbarcodeproject.org :

The Urban Barcode Project (UBP) is a science competition spanning the five boroughs of New York City made possible by funding from the Alfred P. Sloan Foundation. Just as a unique pattern of bars in a universal product code identifies each item for sale in a store, a DNA barcode is a DNA sequence that uniquely identifies each species of living thing. In the project, student research teams use DNA barcoding to explore biodiversity in NYC.

Projects can use DNA barcodes to examine any aspect of the NYC environment, such as:

  • Sampling biodiversity in a park, garden, office, or school.
  • Checking for invasive plant or animal species.
  • Monitoring animal movements or migrations.
  • Identitying exotic or endangered food products in markets.
  • Detecting food or product fraud.

On the website there is a neat 1 min video, a helpful informational brochure, FAQs, and details on $20,000 in prize money!

I am the Scientific Advisor on this project and I think this is a wonderful way for high school students to do science.  And one that is likely to inspire efforts elsewhere.

Visualizing birds so far

In 2004 PLoS Biology, Hebert and colleagues (I am a co-author) observed that differences in COI barcodes among 260 species of North American birds were generally much larger than those within species, with the result that “distinguishing species was generally straightforward.” In addition, we noted 4 birds with large intraspecific divergences that likely represented overlooked species. Our study included only about 1/50 of world birds (out of approximately 10,000 named species) and modest sampling of differences within species (multiple individuals (average 2.4, range 2-10) for 130 species), so not surprisingly some scientists wondered about the generalizability of the findings in birds in particular and animals in general. In an accompanying commentary, Cicero and Moritz wrote “…a true test of the precision of mtDNA barcodes to assign individuals to species…would require that all members of a genus be examined, rather than a random sample of imprecisely-defined close relatives, and that taxa be included from more than one geographic region.” They concluded their essay:

“But to determine when and where this approach [i.e., DNA barcoding] is applicable, we now need to discover the boundary conditions. The real challenge lies with tropical taxa and those with limited dispersal and thus substantial phylogeographic structure. Such analyses need to be taxonomically broad and need to extend beyond the focal geographic region to ensure that potential sister taxa are evaluated and can be discriminated. There is also the need to examine groups with frequent (possibly cryptic) hybridization, recent radiations, and high rates of gene transfer from mtDNA to the nucleus.”

As of today, the BOLD taxonomy browser at Phylum Chordata, Class Aves (www.boldsystems.org/views/taxbrowser.php?taxid=51 indicates over 24,000 barcoded avian specimens representing over 3,800 avian species, nearing 40% of world avifauna. By my count there are over 30 publications on DNA barcoding in birds, including large surveys in North America, Scandinavia, Argentina, Brazil, and Korea.

In the next few posts, I try to look at what we have learned so far, with an emphasis on visual representation. The short answer to the technical question of barcoding effectiveness in birds is that the early observations are borne out, with a few interesting exceptions. My rough summary is that about 95% of bird species can be distinguished by DNA barcode, the remainder are sorted into pairs or small sets of closely-related species, and about 10% of named species show large divergences that likely represent unrecognized species. This last observation brings up an important point–taxonomy is undergoing constant revision, even in a group as well-studied as birds. For example, over the past 30 years, about 10% of the roughly 2000 bird species on the American Ornithologists’ Union Check-list have had species limits revised, and this process is not near closure. So when we compare to sequence data to taxonomic classification, we have to keep in mind the latter is a moving target.

Looked at more broadly, the central finding is that mtDNA sequence differences in birds partition into distinct clusters. Most mtDNA sequence clusters correspond to a single named species, and the ongoing process of taxonomic revision is tightening the one-to-one correspondence between clusters and species designations. In fact, a person with no knowledge of avian biology could closely approximate species numbers and limits simply by sorting COI barcodes into sequence clusters. Of course, the concordance of species limits and mtDNA sequence differences is not a new observation (see for example Avise et al 1987, 1999; Moore 1995), but it is now backed up by much more data. An important but unsolved question is why mtDNA partitions into narrow clusters in birds and other animals. One or more of the proposed mechanisms may turn out to be correct but none has been proven so far.

To begin visualization survey:

Pairwise sequence differences within most bird species are small, usually much less than 1%. (“Pairwise sequence differences” means comparing each individual to every other individual of the same species; for n individuals there are n(n-1)/2 comparisons.) In looking at this data, I think that the absolute scale is important. One of the great benefits of working with a standard barcode region is that we can compare results across diverse taxa. To get into particulars, genetic differences are roughly similar in mtDNA protein coding genes (e.g. COI, cytb), but divergences in the mitochondrial control region are an order of magnitude greater.

Here is a look at MAXIMUM intraspecific distances (K2P metric) among some of the larger geographically-based surveys published so far.  (Most of this information can also be found in published papers cited below.) The aim is to see what we can learn from outliers to the general observation of narrow differences within species. For these illustrations, I went to the Public Projects section of BOLD www.barcodinglife.org, selected a project (Birds of North America Phase II, Kerr 2007; Birds of Argentina Phase I, Kerr et al 2009a; Birds of Scandinavia, Johnsen et al 2010; Birds of the eastern Palearctic, Kerr et al 2009b) and ran a “Nearest Neighbor” analysis with BOLD software, which calculates average and maximum intraspecific distance, as well as identity of and distance to the “nearest neighbor”. The results were copied and pasted into an Excel spreadsheet, sorted by increasing maximum intraspecific distance, and displayed in a graph as shown below (note different y-axis scale for eastern Palearctic). The total number of species are noted on the x-axis; yellow marks those with >1% maximum intraspecific distance. The curves are roughly similar among the regions except that the proportion >1% differs.

In the next post, I look more carefully at the apparent outliers. What makes them different–biology or taxonomy?

Note added 23 March 2011: Kevin Kerr points out that “eastern Palearctic” refers to entire region east of Europe, and thus the eastern Palearctic survey referred to above includes sites spanning most of Russia, Kazakhstan, and Mongolia (not just the eastern half of Russia as highlighted in map).

Note added 24 March 2011: Map corrected to show collecting region for eastern Palearctic survey.

DNA barcoding maps unknowns in Iraq

Rivers and streams are listening devices for watersheds. The best way to assess watershed health is to survey freshwater life downstream. In particular, benthic (bottom-dwelling) macroinvertebrates (visible without magnification) are widely-used indicators of freshwater quality (see for example US EPA page). A challenge for freshwater biomonitoring programs is to rapidly identify the multitude of benthic invertebrate species potentially present in a water sample, and to repeat that for hundreds or thousands of samples. Now imagine you need to assess freshwater quality where few taxonomists have ever ventured.

In January 2011 J N Am Benthol Soc, researchers from U.S. National Museum of Natural History, The American University of Iraq-Sulaimani, and University of Guelph report on DNA barcoding to facilitate biomonitoring in the headwaters of the Tigris River, Iraq. Geraci and colleagues focused on Trichoptera (caddisflies) a group widely used as water quality indicator species. Trichoptera are small, winged insects (approximately 12,000 named caddisfly species worldwide) related to moths and butterflies, with larval stages that develop in freshwater. Sometimes emulated by trout fishermen making lures, caddisfly larvae construct “mobile homes” by gluing together bits of stone, sticks, or other material, with architectural details that help distinguish species. At the time of this study, the world literature on Iraqi Trichoptera consisted of 3 published reports describing 6 species in 7 genera (some larvae were identified only to genus) based on specimens collected between 1919 and 1987.

As part of a larger “Key Biodiversity Areas” survey (for more info, see note below) conducted from 2007 to 2009 by Nature Iraq Organization, the researchers collected benthic macroinvertebrates at twenty sites in three watersheds of the Tigris River during May-June 2008 and January 2009. Four to six replicate samples were obtained at each site, samples were washed with 70% ethanol using a 0.5 mm mesh in the field and again in the laboratory, and caddisfly larvae were removed and stored in 70% ethanol. At two sites adults were collected and placed in separate vials with 70% ethanol. Larvae were sorted using keys for Nearctic and Palearctic Trichoptera, and adults were identified to genus following a key to European species.

Following morphologic sorting, DNA barcoding using standard primers (LepF1/LepR1) was performed on 144 larvae and 6 adults, focusing on individuals in family Hydropsychidae as these were collected in large numbers. Successful amplification was obtained on the first pass with 81.3% of specimens, which is a nice demonstration of robust nature of DNA itself and of amplification protocols, as storage conditions were not what is considered optimal (optimal storage for insects usually means dried immediately after collection, whereas these specimens were stored in 70% ethanol for 2-3 years at the time of analysis).

DNA barcodes of Iraqi specimens were compared to the existing Trichoptera barcode library which so far includes records for about 2500 named species (~19% of world fauna) and many undescribed species. Combining morphologic and DNA data, the researchers identified 16 species in 11 genera and 9 families, with only one of the putative species matching to a previously named organism. There is a lot of Trichoptera taxonomy and molecular phylogeny here, but I will skip to conclusion. This study demonstrates how DNA barcoding, applied to a “virtually unknown fauna”, can build on an existing barcode library to speed species recognition, establish a practical identification method for general use, and link new discoveries to known genera and families.  The authors conclude that “DNA barcoding of benthic macroinvertebrates will be crucial in developing countries that are trying to overcome a lack of knowledge of aquatic-insect taxonomy and trained taxonomists. DNA barcoding will help aquatic scientists in these countries generate the empirical data needed to implement sound bioassessment and monitoring protocols to protect and manage their water resources.”

More generally, I think we can stop looking back longingly at past taxonomic practices and, DNA tools in hand, start helping society and science discover just what is out there, with all the intellectual excitement that entails.

Note added 20 March 2011: Co-author Mohammed Al-Saffar writes: “Key Biodiversity Areas (www.natureiraq.org/site/en) is an ongoing project conducted biannually by Nature Iraq, and we (Nature Iraq, Miami University, Smithsonian NMNH, Guelph University, as well as Clemson University) are in the process of working on the DNA barcoding of all the insects important for monitoring water quality in Iraq such as the Mayflies, Dragonflies, etc.”

DNA reveals panoply of tiny fish

Starksia is a genus of tiny (most less than 2 cm) blennoid fishes found in rocky inshore areas and coral reefs along the Atlantic and Pacific coasts of North and South America and the Caribbean. In 11 February 2011 ZooKeys (open access) researchers from Smithsonian Institution, Ocean Science Foundation, and Nova Southeastern University describe 7 new western Atlantic Starksia species which they first discovered through DNA barcoding.

DNA barcoding revealed divergent clusters within four previously described species and careful re-inspection revealed morphologic characters associated with each genetic cluster.  It is interesting that many of the distinguishing characters are around the head, which may fit with fact that these fish often spend their time largely hidden with only the head exposed. The ZooKeys article is about 51 pages, or about 7 pages per new species, which I think is about average for a species description. If there were a similar printed key for all fishes (about 25,000 named species so far) that would be 175,000 pages long, which is one reason that methods for non-specialists are needed! Of course keys can be posted on the web, as this is, but it is still a challenge to find the right key, especially if you don’t already have a good idea of what you are looking at.

I was surprised that the key did not include barcode sequences of the holotypes (primary specimen chosen to represent the species) (of course these are in GenBank). Even better might be a table of the diagnostic barcode differences among these species, a molecular key.  To try this out, I downloaded the Starksia sequences from Public Projects section of BOLD www.barcodinglife.org, opened in MEGA (free sequence analysis software available at www.megasoftware.net), highlighted all positions that differed among the set, and exported these to Excel including the position numbers which are shown at the top. An excerpt of the output is shown below.

This sort of display could be useful including in a legal setting when you need to document the basis for identification by barcode. The NJ tree gives the right answer of course but it is an abstract representation of the data. A table such as above would show the actual nucleotide sequence differences which are used to generate the tree.

Addendum: I meant to include this very neat feature reading the ZooKeys article online, which is a menu of links that appears if you place your cursor over any species name!

Mao and Deforestation

From the 10 February 2011 New York Review of Books, p. 27, essay by Roderick MacFarquhar on Mao’s Great Famine: The History of China’s Most Devastating Catastrophe, 1958-1962 by Frank Dikötter, Walker, 2010:
“Dikötter’s accounts of the tornado of destruction that the GLF [Great Leap Forward] unleashed continue with … the environment, including the decimation of forests to provide fuel for the backyard furnaces and for homes and construction. As he remarks: The damage varied from place to place, and even in the archives statistics are political artefacts rather than objective reflections of reality. What is certain is that never before had such a large diversity of forests, from the bamboo groves in the south to the alpine meadows and boreal stands of fir and pine in the north, suffered such a prolonged and intense attack.

Mexico barcode surprises

In December 2010 Mitochondrial DNA special issue (open access!) devoted to the Mexican Barcode of Life Initiative (MexBOL), Mexican scientists and colleagues report on barcoding explorations of their megadiverse fauna and flora. A few highlights:

Martínez-Salazar and León-Règagnon from University of Guelph and Universidad Nacional Autónoma de México respectively, examined two morphospecies of Langeronia lung flukes that parasitize Mexican frogs, finding three deeply divergent clusters (approximately 8% uncorrected sequence divergence among specimens from the different regions and 0.3% within). Surprisingly, these clusters were observed in both morphospecies (and did not differ among host species). Based on their results, the researchers conclude that the morphospecies are conspecific, perhaps representing alternative developmental pathways, but it may also be that there are species-level biological differences among the clades not yet recognized. Of note, the 368 bp COI fragment analyzed only partly overlaps the standard barcode region; primers effective for Trematodes are needed.

Cervantes and colleagues (same institutions as above) demonstrated that Common Opposum (Didelphis marsupialis) and Virginia Opposum (D. virginiana), which are sympatric (live in the same area) in Mexico, are readily distinguished by COI barcode (average K2P distances between species are approximately 8%, and within are 1.5%). These species have diagnostic skull morphology but external characters are unreliable, making field identification inaccurate and even museum specimens can be misidentified, including 4 museum specimens in this study.

Zaldívar-Riverón and colleagues from Mexico, Canada, and Argentina applied DNA barcoding to braconid wasps collected during three field trips during 2009 in the Chamela-Cuixmala biosphere reserve, near the Pacific coast. Braconidae is an extraordinarily rich (50,000 -150,000 species) family of tiny parasitoid wasps that attack butterfly larvae, with many species exquisitely specialized to a single host. The researchers obtained barcode sequences from 407 of 483 specimens, and applied computer software (Yule coalescent model) to estimate how many species were present, which turned out to be 185! I’m guessing they sorted specimens to select different morphospecies before sequencing, as it seems improbable that there could be 185 different species among 407 randomly-collected specimens. In this short report, the researchers did not comment on how many of the “barcode coalescent species” correspond to known wasp species. The rate of species discovery did not plateau over the course of the study, pointing to many, many more braconid wasps in just this one area.

At the beginning of the barcode initiative, there was worry from some taxonomists that it wouldn’t work. After 8 years, we know now that a comprehensive library built around taxonomic reference specimens unambiguously names 95% of animal species from mayflies to mammals, and resolves the remainder into small sets of closely-related species. Looking ahead, it seems obvious that one or another analytic approach, such as coalescent modeling described above, will enable construction of a provisional species and higher-level taxonomy from barcode data alone which will be particularly useful for impossibly diverse or poorly-studied groups, such as nematodes. Experts will improve this draft taxonomy as other information becomes available.

I have long thought that the biggest scientific challenge raised by DNA barcoding is not that it doesn’t work to distinguish some closely-related species, but that it works too well–it reveals biodiversity is much more finely divided and highly-specialized than we knew. Even in birds, the best studied large taxonomic group, a comprehensive DNA barcoding survey demonstrated that 24% of bird species that live in Europe and North America are comprised of isolated populations that have been diverging for more than a million years, likely representing distinct species (Johnson J Ornithol 2010; open access). To my mind, the big picture results so far are an exciting provocation–we need to better understand specialization–how do organisms navigate their environment—for example what signals (chemical, behavioral, acoustic, electric, visual?) enable a wasp to find and distinguish one butterfly larvae from another, or to determine whether the larva is already parasitized? Animals can be viewed as highly-discriminating and sensitive detectors of other life forms. A genetic approach might identify parts of the nervous system or sensory system that enable these feats. With better understanding we might construct highly-specific biosensors, say to detect pathogens.

Addendum: Correcting my supposition about whether specimens were sorted prior to barcoding, Dan Janzen tells me that high species counts are the norm when collecting braconid wasps in tropical sites, such that 185 braconid wasp species from 483 specimens is not unusual.