Search this keyword

EOL challenge draft proposal

In the spirit of the Would you give me a grant experiment? [1] here's the draft of a proposal I'm working on for the Computable Data Challenge. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to link names to the primary literature. Comments welcome (the proposal deadline is next week).

The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or click here.



1. The answer from NERC was a resounding "no".

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this:

Obsolete

So the the sequence is hidden. You can retrieve it by clicking on the obsolete version link, but by default it is hidden.

It's an extraordinary state of affairs that a huge slice of fundamental biodiversity data has been effectively "pulled" from view.

UpdateSujeevan Ratnasingham from iBOL has pointed out that the sequence I'd used above (GU679674) was not one of the ones hidden by NCBI, rather it was suppressed at the request of the investigator (which I'd have realised if I'd paid more attention to the screenshot). HQ918317 is an example of a BOLD record that was suppressed:

Hq

Quick thoughts on specimen identifiers

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap, resolvable, and persistent. We get to pick two.

Cheap and resolvable means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do.

Cheap and persistent means things like Darwin Triplet Core or URNs. You can write things on paper and they will persist (the Biodiversity Heritage Library shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and that was a disaster (we learnt a lot, but what a mess).

Resolvable and persistent. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by CrossRef and DataCite, and will also play ball with linked data folks.

In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection).

The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in BHL and GBIF as biomedical databases).

If we wrote this as a DOI of the form <doi prefix>/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:

10.nnnnn/MAL/ANSP332467

where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't rely on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.

As I've argued before, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged.

I'm hoping we'll make some progress on this in the coming months...