Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap, resolvable, and persistent. We get to pick two.
Cheap and resolvable means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do.
Cheap and persistent means things like Darwin Triplet Core or URNs. You can write things on paper and they will persist (the Biodiversity Heritage Library shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and that was a disaster (we learnt a lot, but what a mess).
Resolvable and persistent. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by CrossRef and DataCite, and will also play ball with linked data folks.
In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection).
The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in BHL and GBIF as biomedical databases).
If we wrote this as a DOI of the form <doi prefix>/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:
10.nnnnn/MAL/ANSP332467
where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't rely on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.
As I've argued before, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged.
I'm hoping we'll make some progress on this in the coming months...
Quick thoughts on specimen identifiers
Labels:
CrossRef
,
DataCite
,
DOI
,
identifiers
,
specimen codes
,
specimens
EOL Computable Data Challenge community
Now we are awash in challenges! EOL has announced its Computable Data Challenge:We invite ideas for scientific research projects that use EOL, including the Biodiversity Heritage Library (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project. We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.
Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a Computable Data Challenge community to discuss the challenge.
It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.
BHL and GBIF as biomedical databases
When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research. Again, great stuff, but aren't museums simply full of dead stuff that people have collected and forgotten about?
But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of linking BHL and GBIF content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):
Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis by Davis et al..

This paper has been indexed in PubMed (PMID:948206, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).
The article page in BioStor shows a map derived from the OCR text, showing a two localities:

Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., ANSP Malacology 330925).
If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis et al. paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature.
It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.
But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of linking BHL and GBIF content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):
Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis by Davis et al..

This paper has been indexed in PubMed (PMID:948206, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).
The article page in BioStor shows a map derived from the OCR text, showing a two localities:

Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., ANSP Malacology 330925).
If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis et al. paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature.
It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.
Labels:
BHL
,
biomedical
,
GBIF
,
linking
,
Mekong River Schistosomiasis
,
PubMed
,
schistosomiasis
Subscribe to:
Posts
(
Atom
)