Search this keyword

Exporting data from Australian Faunal Directory on CouchDB

Quick note to self about exporting data from my Australian Faunal Directory on CouchDB project. To export data from a CouchDB view you can use a list function (see Formatting with Show and List). Following the example on the Kanapes IDE blog, I created the following list function:

{
"_id": "_design/publication",
"_rev": "14-467dee8248e97d874f1141411f536848",
"language": "javascript",
"lists": {
"tsv": "function(head,req) {
var row;
start({
'headers': {
'Content-Type': 'text/tsv'
}
});
while(row = getRow()) {
send(row.value + '\\t' + row.key + '\\n');
}}"
},
"views": {
.
.
.
}
}


I can use this function with the view below, which lists Australian Faunal Directory publications by UUID ("value"), indexed by DOI ("key").

Couch

I can get the tab-delimited dump from http://localhost:5984/afd/_design/publication/_list/tsv/doi. Note that instead of, say, /afd/_design/publication/_view/doi to get the view, we use /afd/_design/publication/_list/tsv/doi to get the tab-delimited dump.

I've created files listing DOIs and BioStor ids for publications in the Australian Faunal Directory. I'll play with lists a bit more, specially as I would like to extract the mapping from the Australian Faunal Directory on CouchDB project and add it to the iTaxon project.

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Banner05
Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.





Google doesn't like BioStor anymore

According to Google Analytics BioStor has experienced a big drop in traffic since the start of October:

Panda

At one point I'm getting something like 4500 visits a week, now it's just over a thousand a week. I'm guessing this is due to Google's 'Panda' update. I suspect part of the problem is that in terms of text content BioStor is actually pretty thin. For each article there is some metadata and a few links, so it probably looks a little like a link farm. The bulk of the content is in the page images, which of course, Google can't read.

I'd be interested to know of any other sites in the field that have been affected in the same way (or, indeed, sites which have seen no change in their traffic since October).