Search this keyword

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Banner05
Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.





Google doesn't like BioStor anymore

According to Google Analytics BioStor has experienced a big drop in traffic since the start of October:

Panda

At one point I'm getting something like 4500 visits a week, now it's just over a thousand a week. I'm guessing this is due to Google's 'Panda' update. I suspect part of the problem is that in terms of text content BioStor is actually pretty thin. For each article there is some metadata and a few links, so it probably looks a little like a link farm. The bulk of the content is in the page images, which of course, Google can't read.

I'd be interested to know of any other sites in the field that have been affected in the same way (or, indeed, sites which have seen no change in their traffic since October).

These are my species - finding the taxonomic names I published using Mendeley

The latest addition to my mapping of taxonomic names to the literature (http://iphylo.org/~rpage/itaxon/) is the ability for authors with Mendeley accounts to find the names they've published. This is an extension of the "I wrote that" tool I developed earlier.

Let's say I want to show the names that a given author has published. I could search by that author's name, but that raises all sorts of issues (see my earlier posts ReaderMeter: what's in a name? and Equivalent author names), especially for this database where I have incomplete citations and in many cases lack author names beyond surname.

Another way to tackle the problem is if I have a list of publications for an author, then all I need to do is match that list to the publications in my taxonomic database. If both lists have identifiers for the publications, such as DOIs, then the task is trivial. But, where do I get these lists?

An obvious source is Mendeley, where people are building lists of their own publications (as well as other publications that they are interested in). For example, my publications are listed at http://www.mendeley.com/profiles/roderic-page/.

But I don't want to have to get these lists myself, I'd much rather that a Mendeley user could go to my taxonomic database, say "I have this Mendeley account, show me the names I've published". One reason I'd like to do this is that if I want people to engage with this project it would be nice to be able to offer an immediate reward, in this case, a place where you can show your contribution to the task of cataloguing life on this planet.

Finding my taxonomic names

If you have a Mendeley account here's what you do:

Go to http://iphylo.org/~rpage/itaxon/. At the top right you will see a "Sign in using Mendeley" link.

M1
Click this and you will be taken to Mendeley where you will be asked if you'd like to allow http://iphylo.org/~rpage/itaxon/ to connect to your account (if you're already logged in to Mendeley then you'll see an Accept button, otherwise Mendeley will ask you to log in).

M2
If you click on Accept then you will be taken back to my site and you should now see your profile name and picture on the top right:

M3

If you click on the Profile link then my site will talk to Mendeley and get a list of your papers and look for them in my database. If it find a paper it outputs the taxonomic names published in that paper. For example, here is my profile:

M4

Listed are the species of bird lice in the genus Dennyus described in a paper on which I was a coauthor (http://dx.doi.org/10.1046/j.1365-3113.1996.d01-13.x).

This list is incomplete as earlier papers of mine on crab and isopod taxonomy aren't listed because these lack identifiers. This is something I need to work on, but for now this seems like a simple way to enable someone to go to the http://iphylo.org/~rpage/itaxon/ mapping between taxonomic names and literature and find the names they've authored.

If you have a Mendeley account, and your list of publications in Mendeley includes papers describing new animal species, go to http://iphylo.org/~rpage/itaxon/ and try it out.