Search this keyword

Duplicate DOIs for the same article: alias DOIs, who knew?

As part of a project to map taxonomic citations to bibliographic identifiers I'm tackling strings like this (from the ION record urn:lsid:organismnames.com:name:1405511 for Pseudomyrmex crudelis):

<tdwg_co:PublishedIn>
Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants. Zoological Journal of the Linnean Society, 126(4), August 1999: 451-540. 516 [Zoological Record Volume 136]
</tdwg_co:PublishedIn>

I parse the string into its components (e.g., journal, volume, issue, pagination) and use scripts to locate identifiers such as DOIs. I regard DOIs as the gold standard for bibliographic identifiers. The are (usually) unique, and CrossRef provides some really useful services to support them (DOIs now also support linked data if you are in to that sort of thing). Occasionally there are problems, such as duplicate DOIs when material moves from a publisher's site to, say, JSTOR. And some publishers are really, really bad at releasing DOIs that don't resolve. For example, Taylor & Francis Online have at least 18,000 DOIs for the Annals and Magazine of Natural History that don't resolve (e.g., doi:10.1080/00222933809512318 for this paper).

Sometimes my automated scripts for finding DOIs fail and I have to resort to Googling. To my surprise, I found two versions of the paper "Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants", each with a different DOI:


Now, this isn't supposed to happen. Interestingly, if you resolve doi:10.1006/zjls.1998.0158, either on the web or using CrossRef's OpenURL resolver, you get the page/metadata for doi:10.1111/j.1096-3642.1999.tb00157.x.

To see what was going on I fired up my local installation of Tony Hammnd's OpenHandle tool (see http://bioguid.info/openhandle/) and entered the Elsevier DOI (10.1006/zjls.1998.0158) and got this:


{
"comment" : "OpenHandle (JSON) - see http://code.google.com/p/openhandle/" ,
"handle" : "hdl:10.1006/zjls.1998.0158" ,
"handleStatus" : {
"code" : "1" ,
"message" : "SUCCESS"
} ,
"handleValues" : [
{
"index" : "100" ,
"type" : "HS_ADMIN" ,
"data" : {
"adminRef" : "hdl:10.1006/zjls.1998.0158?index=100" ,
"adminPermission" : "111111110111"
} ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Thu Apr 13 19:09:03 BST 2000" ,
"reference" : []
} ,
{
"index" : "1" ,
"type" : "URL" ,
"data" : "http://linkinghub.elsevier.com/retrieve/pii/S0024408298901583" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:12 BST 2003" ,
"reference" : []
} ,
{
"index" : "700050" ,
"type" : "700050" ,
"data" : "20030811104844000" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:16 BST 2003" ,
"reference" : []
} ,
{
"index" : "1970" ,
"type" : "HS_ALIAS" ,
"data" : "10.1111/j.1096-3642.1999.tb00157.x" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Mon Aug 25 21:06:50 BST 2008" ,
"reference" : []
}
]
}

The interesting bit is the "HS_ALIAS" at the bottom. I'd not come across this before, although it's in the spec (RFC 3651) for all to see (yeah, but who reads those?). The handle system that underlies DOIs has mechanism to support aliases, so that a DOI that originally pointed to a web page (say, for an article) can be redirected to point to another DOI. In this case, the Elsevier DOI redirects to the Wiley DOI ("10.1111/j.1096-3642.1999.tb00157.x" in the HS_ALIAS section), so the user ends up at Wiley's page for this article, not Elsevier's. This provides a way to accommodate changes in article ownership, without requiring an existing publisher to reuse the previous publisher's DOI.

In one sense this seems to defeat the point of DOIs, namely that they are effectively opaque identifiers that any publisher should be able to host. Perhaps in this case the issue is that the DOI prefix ("10.1006" and "10.1111" for Elsevier and Wiley, respectively) corresponds to a publisher, and when something goes wrong with a DOI it's easier to identify who is responsible based on this prefix, rather than the individual DOI.

In any event, next time I come across a duplicate DOI I'll need to check whether it is an alias of another DOI before launching into another rant about the (occasional) failings of DOIs.

Suggested apps for BHL's Life and Literature Code Challenge


Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

- Posted using BlogPress from my iPad

Tree of Life 0.1 - annotating the NCBI taxonomy

Last week I was at the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab, run by KnowInnovation.com/. It was an interesting experience, essentially a structured week of brainstorming ideas.

One thing I came away with is the feeling that our notions of the "tree of life" are fuzzy, contradictory, and often probably unobtainable. It's tempting to imagine all sorts of wonderful visualisations, and loose sight of building something that is useful. Perhaps it's time instead to think of "Tree of Life version 0.1".

Imagine taking the NCBI taxonomy as a starting point. Yes it's incomplete, and has almost no fossils, but it's freely available and linked to a lot of data. Let's use a Google Maps-like viewer along the lines I explored earlier this year.

Then add annotation "tracks" to the tips. As a first pass these could be taken from the NCBI LinkOut service, such as the NCBI-Wikipedia mapping http://iphylo.org/linkout.

Ncbi 1

The NCBI tree is a classification rather than a phylogeny, so we could add greater phylogenetic content by linking to phylogenetic databases, such as TreeBASE and PhyLoTA. Imagine clicking on a node in the NCBI taxonomy and seeing a display of all the phylogenies centred on that node:

Ncbi 02

Now we have a way to navigate a large tree, view annotations, and display phylogenetic trees. All of this could be done fairly easily. The key is to have services keyed by the NCBI tax_id used to identify nodes on the tree.

Among the next steps would be to add additional "tracks", perhaps based on curated links analogous to the wiki-based NCBI-Wikipedia mapping. For example, very basic habitat data (marine or terrestrial) could be added, or geography, or host relationships (could be based in part on the data already in GenBank).

Given that the NCBI tree continues to grow, subsequent versions could be released as the tree changes. Or we could "fork" the NCBI tree and start to refine it based on phylogenetic information, and add taxa that aren't in the genome databases (these taxa will need consistent identifiers so we can map annotations on to them as well). Perhaps we could use something like Git to manage this tree, and to handle the necessary merging of updated versions of the NCBI tree. People could edit the tree, or indeed fork it and come up with their own.

Logo tmp reasonably smallThere are lots of ways to visualise trees (see TreeVis.net for some great examples), but what I'm after is a tool that is useful, that gives us a sense of what we know and what we don't. I suspect that one of the reasons we've struggled with visualising the tree of life is that there are lots of different notions about what it's for. In this case, I want a tool to navigate data about organisms, one that we can easily add annotations too.