Early in Michael Crichton’s 1990 novel Jurassic Park, Dr. Henry Wu, chief scientist at Jurassic Park Research Insitute, showing visitors around his facility, displays “the actual structure of a small fragment of dinosaur DNA“. Astute readers pointed out Dr. Wu’s dinsosaur genetic resuscitation project was unlikely to succeed, as the sequence in Crichton’s novel was a fragment of the bacterial plasmid pBR322. They discovered this by feeding the “dinosaur sequence” into the online BLAST software engine, which searches the billions of base pairs of nucleotide sequences deposited in the amazing public resource of GenBank and the other international genetic databases, EMBL and DDBJ.
The power of genetic databases as identification tools rests on the quality of sequences and their annotations. Just as we need regularly updated maps for safe navigation, we need regularly updated genetic databases for accurate identifications.
One of the strengths of GenBank is that it serves as a permanent repository for genetic sequence data. As a result, GenBank is sometimes a permanent repository for faulty data. In a recent PLoS One paper, researchers from Goteborg University and Chalmers University of Technology, Sweden, and University of Tartu, Estonia, examined the taxonomic reliability of the 51,534 fungal internal transcribed spacer (ITS) sequences in the International Nucleotide Sequence Database (ie GenBank, EMBL, DDBJ). ITS is the most widely used locus for species identification in fungi. The results show a “variegated picture of the taxonomic status of publicly indexed fungal sequences“. Taxonomic coverage is sparse: of the estimated 1.5 million fungi, less than 1% (9,684 species) are represented. Taxonomic data is lacking for many sequences (27% are not identified to species level), and most of the species-level identifications are unverifiable (82% are not linked to voucher specimens, 63% are not tagged with specimen country of origin, and 42% are marked as unpublished). Sequence comparisions suggest mislabeling is common (11% show best matches to congeneric but heterospecific sequences, and another 7% match among species of a different genus. Overall 10-21% of the INSD sequences have incorrect or unsatisfactory annotations.
It seems better to start over than to try to revise this Tower of Babel. Nilsson et al conclude “the large body of insufficiently identified fungi in INSD constitutes a silent plea for a wide and generalized sequencing effort of well-identified and -annotated [type] specimens residing in herbaria worldwide.” Toward this end, an All-Fungi Barcoding Initiative Workshop will be held 14-15 May 2007 at the Smithsonian Center for Research and Conservation, Fort Royal, Virginia. An international collection of researchers aim to hammer out how to build a reliable database, including which gene(s) should be adopted as standard barcode targets.
So far, DNA-based fungal identifications have primarily used ITS. Other nuclear genes have been used in some studies including the nuclear large ribosomal subunit, beta-tubulin, and elongation factor 1-alpha. It would be excellent if the fungal barcode database could link directly with those being built around the mitchondrial gene COI, which is effective for resolving most protozoan and metazoan (multicellular animal) species examined so far. In this regard it is exciting that a report by Seifert et al in 6 March 2007 Proc Natl Acad Sci USA shows COI provides species-level resolution similar to that for ITS, amplification was generally straightforward, and introns in the COI gene were found in only 2 of 370 Penicillium strains.