Search this keyword

Yet more reasons to have specimen identifiers: annotating GenBank sequences

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated. For example, every GenBank sequence has a unique accession number, so if you and I are discussing sequence DQ055738, you and I can be sure we are talking about the same thing.

Sequence DQ055738 is interesting because Hua et al. A Revised Phylogeny of Holarctic Treefrogs (Genus Hyla) Based on Nuclear and Mitochondrial DNA Sequences (http://dx.doi.org/10.1655/08-058R1.1 - note the nice identifier we have for this article) have suggested this sequence (published in http://dx.doi.org/10.1554/05-284.1, another nice identifier) is misidentified. Given these identifiers we could construct various statements, such as:


DQ055738 -> published in -> doi:10.1554/05-284.1
DQ055738 -> annotated by -> doi:10.1655/08-058R1.1

(I've omitted the http:// stuff to keep things legible). Hua et al: state the following:

However, the tissue number of this specimen (LSUMZ H-19067) is similar to that of a specimen of H. versicolor (LSUMZ H-19077), which appears to have been processed at the same time (C. Austin, personal communication). Therefore, we hypothesize that the sequence data for H. gratiosa used by Smith et al. (2005) were actually from H. versicolor.

It would be nice if we had unique, resolvable identifiers for LSUMZ H-19067 and LSUMZ H-19077 so that we could construct statements linking the sequence, the publications, and the specimens. But we don't. Nor is it obvious how to find out anything more about LSUMZ H-19067 and LSUMZ H-19077. By contrast, for the DOI or the sequence accession I know how to get more information, in either human- or machine-readable form.

The acronym LSUMZ in this case is the Lousiana State University Museum of Natural Science Herpetology collection (http://biocol.org/urn:lsid:biocol.org:col:34806). Just to confuse matters, LSUMZ specimens in GBIF use LSU as the acronym for Lousiana State University Museum of Natural Science. Given that GBIF's data comes from LSU itself, it's odd (but not surprising) that there's a muddle about which acronym to use (it would be nice to clear this up, but then anybody building identifiers based on those acronyms is in for some heartbreak).

If I look at GBIF LSUMZ records there aren't specimens with the catalogue numbers H-19067 or H-19077. However, after a bit of poking around, and a helpful file from GBIF's Tim Robertson, I discovered that the LSUMZ herpetology tissue numbers (which is what the H-* codes actually are) are stored in GBIF, so I've found the corresponding specimens are http://data.gbif.org/occurrences/45716232 (LSU Herp 84850, LSUMZ HerpNet Tissue 19067) and http://data.gbif.org/occurrences/45710033 (LSU Herp 84862, LSUMZ HerpNet Tissue 19077). (Note that Hua et al. tell the reader that LSU 84850 = LSUMZ H-19067, but don't give the specimen code for LSUMZ H-19077).

Now I have some resolvable identifiers, so I could construct statements like:


DQ055738 -> voucher -> occurrences/45716232
DQ055738 -> voucher -> occurrences/45710033
|
+-> according to -> doi:10.1655/08-058R1.1

Let's skip over whether this is actually the best way to record the annotation, the point is we can now start to construct statements that can be linked to the wider world. If someone else has made statements about these specimens, and they used the GBIF URL, then we could aggregate those and learn more about these specimen and their associated sequences. Without globally unique, stable, resolvable identifiers we are left to flounder around in the bowels of various databases searching for something that may or may not be the object being discussed. Isn't it time we did something about this?

Making biodiversity data sticky: it's all about links

Who invented velcro?

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (preprint, published version) summarises some of the links between diverse kinds of biological data:
Model
While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:
Links
For example, the iPhylo Linkout project creates links between taxon concepts in NCBI and Wikipedia. The iTaxon project is a mapping between taxonomic names and publications. I've briefly explored mapping host-parasite relationships using GenBank, and I'm currently exploring the links between publications and specimens. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

GbifBrief update on yesterday's post about finding specimens in BioStor. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text. Of these 143,000 occurrences, 81,000 have been matched to an occurrence in GBIF.

The top ten collections with specimens in BioStor are:

DatasetNumber of specimens
NMNH Vertebrate Zoology Herpetology Collections (National Museum of Natural History)11194
Herpetology Collection (University of Kansas Biodiversity Research Center)9619
Herpetology Collection (University of Kansas Biodiversity Research Center)9328
NMNH Invertebrate Zoology Collections (National Museum of Natural History)9061
CAS Herpetology Collection Catalog (California Academy of Sciences)6720
MCZ Herpetology Collection (Museum of Comparative Zoology, Harvard University)5818
NMNH Vertebrate Zoology Fishes Collections (National Museum of Natural History)4642
MCZ Herpetology Collection - Reptile Database (Museum of Comparative Zoology, Harvard University)4380
FMNH Herpetology Collections (Field Museum)2110
FMNH Fishes Collections (Field Museum)2061


This is pretty much what I expected. Virtually complete runs of publications from The Field Museum at Chicago, the University of Kansas, and the Biological Society of Washington are available in BHL, and many of these have been added to BioStor. These journals have extensive taxonomic treatments of vertebrate taxa, particularly frogs, hence herpetology collections dominate the rankings.

There will inevitably be errors in the mapping between specimen codes and GBIF occurrences. I've tried to minimise these by mapping codes within taxonomic groups, but it's clear that there are duplicate codes even within some collections. There is also all manner of variation in the way people cite museum specimens, and these are often different from the codes that appear in GBIF. There will also be issues with extracting specimen codes, and I'm also discovering a few *cough* duplicates of articles in BioStor, so the numbers I present above are liable to change as I clean things up.

But one could imagine a "league table" of museum collections, where we can measure both the extent to which those collections have been digitised, and the extent to which material from those collections have been cited. We could use this to compute measures of the impact of a collection.

But for now I'm browsing the results trying to get a sense of how successful the mapping has been. There are some interesting examples. The specimen codes extracted from the article Review Of The Chewing Louse Genus Abrocomophaga (Phthiraptera : Amblycera), With Description Of Two New Species are those for the mammalian hosts of the lice. Hence someone viewing the records for these specimens and following the link to this paper would discover that these mammals had parasitic lice. If we add other sorts of links to the mix, such as between specimens and DNA sequences, then we can start to build a rich network of connections between the basic data of biodiversity.