Blog

DNA barcodes suggest fractal nature of genome

Growing data sets demonstrate DNA barcoding usually works, but why? Why does a very short stretch of DNA, such as a DNA barcode which usually represents less than one one-millionth of the genome, enable identification of most animal species? In computer language, Rod Page describes a DNA barcode as “embedded metadata“. Here I suggest an analogy to fractals, which might help convey what DNA barcodes reveal about how genomes are constructed.

DNA barcoding usually works because patterns seen in very short DNA sequences usually reflect patterns seen in longer sequences. In this way, DNA barcodes demonstrate “self-similarity”, a fundamental property of fractals. In March 28, 2007 PLoS One, researchers from Concordia University, Quebec, analyze 849 complete animal mitochondrial genomes, comparing GC composition in 648 bp COI barcode region to GC composition in the mitochondrial genome as a whole. Min and Hickey found “such short sequences can yield important, and surprisingly accurate, information about the [mitochondrial] genome as a whole. In other words, for unsequenced genomes, the DNA barcodes can provide a quick preview of the whole genome.” It will be of great interest to extend this analysis to compare mitochondrial barcodes to nuclear genomes; the general success of barcoding approach suggests there will be similarly close correlation.

Overall, the patterning of barcode differences supports the emerging view that selective sweeps prune mitochondrial diversity within species and mitochondrial and nuclear co-evolution are tightly linked.

.

.

.

.

Avian genetic data legacy relatively modest, jumbled

Birds have been the subject of intense genetic study for over 20 years. How large is the legacy of avian genetic data? Researchers studying species-level differences in birds and other vertebrates have most often used cytochrome b (cyt b). Some have worried that analyzing COI in birds is redundant as there are already cyt b sequences for many avian species. Of course, even if there were a complete set of cyt b sequences for the approximately 10,000 species of world birds, it would still be beneficial to collect COI barcodes to enable wide comparisons across the diversity of life.  Here I look briefly at what is in GenBank for birds. To do so, I prepared a spreadsheet “avian name look-up.xls”, which recognizes 2,347 synonyms, alternate and mis-spellings, and extinct species, producing an output list of extant bird names harmonized to Clements.

There are more than 1 million sequences in GenBank, but over 900,000 are from the Jungle Fowl (ie chicken, Gallus gallus), and another 85,000 from Zebra finch (Taeniopyga guttata) and Wild turkey (Meleagris gallopavo).  That leaves about 67,000 sequences in total representing the rest of the approximately 10,000 species of world birds. According to Clements’ Birds of the World (including updates through 2006), there are 9,919 recognized species. The other world lists are very similar, and differ primarily in whether certain forms are recognized as species or subspecies and in assignment of generic names. I find it surprising there is not a single global taxonomic authority for bird species status, names, spelling, generic and family classification.  As a comparison, medicine would be in great difficulty if there were not a single standard nomenclature for pathogenic bacteria

62,571 of the remaining 66,969 sequences are in the “CoreNucleotide” database (the others are unnamed genetic loci, either Expressed Sequence Tag (EST) or Genome Survey Sequence (GSS) records, and these will not be considered further here).  Only 4,951 bird species are represented by any sequence (50% of world birds), and there are cytochrome b sequences for only 2,751 species (28% of world birds). Of species with cyt b sequences, 60% are represented by single sequences.

How does this compare to COI barcode data so far? As tracked on the All Birds Barcoding Initiative website, researchers have collected 8,353 COI barcode records from 1,730 species, including 2 or more sequences from approximately 80% of species analyzed to date.

Virtues of the DNA barcode data set include that sequences are linked to vouchered museum specimens and their associated collecting data, sequence records include trace files to confirm sequencing accuracy, and most important all sequences can be directly compared because they derive from a standardized region. GenBank cyt b files include sequences of varying length and position along the gene. An alignment of 1000 avian COI barcodes and 1000 avian cyt b sequences hints at the power of a standardized approach.

.

.

.

.

.

.

.

.

.

Encyclopedia of Life dream comes true

A dream of many came to life this week with launch of Encyclopedia of Life. In the words of E.O. Wilson, “imagine an electronic page for each species of organism on Earth available everywhere by single access on command”.  Encyclopedia of Life is a global effort to document all 1.8 million named species of plants and animals on Earth in a free online resource. With support from the John D. and Catherine T. MacArthur Foundation and the Alfred P. Sloan Foundation, scientists from many institutions including Field Museum of Natural History, Harvard University, Marine Biological Laboratory, Smithsonian Institution, and Biodiversity Heritage Library have joined together to initiate the project. Like Wikipedia, the Encyclopedia of Life aims to draw on the global pool of expertise, allowing users to add information and details, such as species sightings and photos, with the content authenticated by scientists. From EOL’s home page: 

“Comprehensive, collaborative, ever-growing, and personalized, the Encyclopedia of Life is an ecosystem of websites that makes all key information about life on Earth accessible to anyone, anywhere in the world. Our goal is to create a constantly evolving encyclopedia that lives on the Internet, with contributions from scientists and amateurs alike. To transform the science of biology, and inspire a new generation of scientists, by aggregating all known data about every living species. And ultimately, to increase our collective understanding of life on Earth, and safeguard the richest possible spectrum of biodiversity.”

To highlight just one component of the project, the Scanning and Digitization Group is addressing the critical need for wider access to published literature, including older works. At present, “to identify a rare specimen, a biologist may need to consult a 100 year-old text because that was the last time the species was found, described, and recorded. This essential historical reference gives exceptional value to the libraries encompassed by the partners of the Biodiversity Heritage Library [a colloboration of ten natural history museums, herbaria, and research institutions]. Today, mainly those few who can enter their library doors can read the wealth of the world’s publications held within. This effectively hides this storehouse of knowledge about biodiversity from a range of applications, including research, education, taxonomy, disease control, and the maintenance and protection of ecosystems.”

The Scanning and Digitization Group will accelerate the work of the Biodiversity Heritage Library, an ongoing effort which has already digitized 1.25 million pages, enabling “citizens unaffiliated with major institutions to search, read, and download articles previously unavailable to them. Educators can guide students’ biological research with a wealth of examples incorporated in lesson plans and assignments. Illustrations in rare taxonomic works can inspire artists. The openly available Biodiversity Heritage Library will link the great biodiversity in tropical and developing countries to literature about biodiversity primarily held in a few North American and European libraries, a significant intellectual repatriation.”

I believe DNA barcode libraries will provide an essential genetic “index” for locating species pages in the Encylopedia of Life. The best-trained human mind can identify a few thousand species. Comprehensive DNA barcode libraries and inexpensive, portable sequence devices will enable anyone to find EOL’s home page for multimillions of species, regardless of life stage, gender, or whether the specimen is in bits and pieces.

 There is a thrilling launch video–do not miss it!

100 million years without sex: COI clustering in bdelloid rotifers challenges theories of how species are formed and maintained

Philodina roseola, 400 um, Meselson LaboratoryWhy are there species? The usual answer is sex: reproductive isolation maintains differences between species and reproductive mixing maintains similarity within species.  According to recent work with bdelloid (the “b” is silent) rotifers, a group of microscopic invertebrates thought to have adopted asexuality 100 million years ago, sex is not necessary! In September 2005 Hydrobiologia 546:29, researchers at the University of Arizona analyzed mitochondrial COI of 102 females of 21 morphologically defined species of bdelloid rotifers, including many sympatric morphospecies. Contrary to predictions of evolutionary theory for asexual organisms, Birky et al show that these are 21 independently evolving clades, with small differences within and large differences among lineages, the same patterning seen in COI analyses of sexual reproducing species. Also contrary to predictions, the Ka/Ks ratio  (expressed/silent mutations) indicates that COI is subject to strong selection. [In asexual organisms, there is less need for sampling multiple genes because the entire genome is a single linkage unit. Thus genetic differences in COI are expected to reflect evolutionary history of the organism, i.e. the “gene tree” is expected to be the same as the “species tree.”] (For fun see Birky lab bdelloid video!)

Fontaneto et al PLoS Biol April 2007In April 2007 PloS Biology researchers from University of Milan, Italy, Imperial College London, University of Cambridge, and Royal Botanic Gardens, Kew analyze morphometric and molecular data of a comprehensive international sampling of Rotaria sp. bdelloid rotifers. All 9 morphologically defined taxonomic species form monophyletic clades in genetic analysis. Multiple clusters in several morphospecies show distinct morphometric measurements of mouth parts, suggesting these represent cryptic species with ecological specialization. Fontaneto et al observe “bdelloids display the same qualitative pattern of genetic and morphological clusters, indicative of diversification into independently evolving and distinct entities, as found in sexual clades” and conclude “this refutes the idea that sex is necessary for diversification into evolutionary species.”  

In these studies, COI sequences accurately identify bdelloid rotifer species, further demonstrating the robustness of DNA barcoding. What is scientifically exciting is how broad application of standardized, minimalist genetic analysis (aka DNA barcoding), combined with traditional morphologic and ecologic study, is providing radical new insights into biology.

Fungal database a Tower of Babel, needs rebuilding

Early in Michael Crichton’s 1990 novel Jurassic Park, Dr. Henry Wu, chief scientist at Jurassic Park Research Insitute, showing visitors around his facility, displays “the actual structure of a small fragment of dinosaur DNA“. Astute readers pointed out Dr. Wu’s dinsosaur genetic resuscitation project was unlikely to succeed, as the sequence in Crichton’s novel was a fragment of the bacterial plasmid pBR322. They discovered this by feeding the “dinosaur sequence” into the online BLAST software engine, which searches the billions of base pairs of nucleotide sequences deposited in the amazing public resource of GenBank and the other international genetic databases, EMBL and DDBJ.

The power of genetic databases as identification tools rests on the quality of sequences and their annotations.  Just as we need regularly updated maps for safe navigation, we need regularly updated genetic databases for accurate identifications.

One of the strengths of GenBank is that it serves as a permanent repository for genetic sequence data. As a result, GenBank is sometimes a permanent repository for faulty data.  In a recent PLoS One paper, researchers from Goteborg University and Chalmers University of Technology, Sweden, and University of Tartu, Estonia, examined the taxonomic reliability of the 51,534 fungal internal transcribed spacer (ITS) sequences in the International Nucleotide Sequence Database (ie GenBank, EMBL, DDBJ). ITS is the most widely used locus for species identification in fungi. The Tower of Babel, Bruegel, Pieter the Elder, 1563The results show a “variegated picture of the taxonomic status of publicly indexed fungal sequences“.  Taxonomic coverage is sparse: of the estimated 1.5 million fungi, less than 1% (9,684 species) are represented. Taxonomic data is lacking for many sequences (27% are not identified to species level), and most of the species-level identifications are unverifiable (82% are not linked to voucher specimens, 63% are not tagged with specimen country of origin, and 42% are marked as unpublished). Sequence comparisions suggest mislabeling is common (11% show best matches to congeneric but heterospecific sequences, and another 7% match among species of a different genus. Overall 10-21% of the INSD sequences have incorrect or unsatisfactory annotations. 

It seems better to start over than to try to revise this Tower of Babel.  Nilsson et al conclude “the large body of insufficiently identified fungi in INSD constitutes a silent plea for a wide and generalized sequencing effort of well-identified and -annotated [type] specimens residing in herbaria worldwide.” Toward this end, an All-Fungi Barcoding Initiative Workshop will be held 14-15 May 2007 at the Smithsonian Center for Research and Conservation, Fort Royal, Virginia. An international collection of researchers aim to hammer out how to build a reliable database, including which gene(s) should be adopted as standard barcode targets.  

So far, DNA-based fungal identifications have primarily used ITS. Other nuclear genes have been used in some studies including the nuclear large ribosomal subunit, beta-tubulin, and elongation factor 1-alpha. It would be excellent if the fungal barcode database could link directly with those being built around the mitchondrial gene COI, which is effective for resolving most protozoan and metazoan (multicellular animal) species examined so far. In this regard it is exciting that a report by Seifert et al in 6 March 2007 Proc Natl Acad Sci USA shows COI provides species-level resolution similar to that for ITS, amplification was generally straightforward, and introns in the COI gene were found in only 2 of 370 Penicillium strains. 

Chauncey Starr

Chauncey Starr, who helped launch the Program for the Human Environment, passed away on 18 April 2007, shortly after his 95th birthday celebration. In fond memory of Chauncey, we post his biography, a list of key papers, and obituary. At Chauncey’s 90th birthday, Jesse offered a brief tribute.

Resolving leaves and branches?

COI barcoding is a standardized approach to identifying species by DNA, helping resolve the “leaves” on the tree of life.  Will the growing arrays of COI sequences also help provide insight into evolutionary history, the “branches” of the tree? I am struck that in some cases, simple genetic arithmetic with COI sequences creates trees very similar to modern phylogenies painstakingly created from multiple nuclear and mitochondrial genes, multiple morphologic characters, and exhaustive computerized analysis. Shown at right, a neighbor-joining analysis of COI barcodes of 264 species of North American birds creates a tree that is quite similar to the most recent understanding of relationships among modern birds, with anseriformes (ducks and geese) next to galliformes (grouse and quail) at the top, passeriformes (perching birds) at the bottom, and most of the other established orders appearing as single lineages in between. Most of the families and groupings of families within these orders also match current understanding, including for example, that flycatchers appear as the basal lineage within passeriformes, and a group of New World passerines called nine-primaried oscines appear together at the bottom of the tree. Of course, a systematist would immediately note that the statistical support for these higher-order branches is weak or absent. I should tread lightly here or not at all, since phylogenetics is the province of mathematical experts, but I will plunge ahead anyway and suggest that, if the single gene neighbor-joining distance tree is “right” say 8 times out of 10, then the apparent lack of statistical support for higher order branches is misleading. There are cases where the COI gene tree is incorrect (eg Hajibabaei et al Genome 49:851 2006).

I close with a picture inspired by the data. If single gene trees usually correspond to evolutionary history, this implies strong barriers to gene flow arise concurrent with differences in the single gene and are continuously maintained.