Search this keyword

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

More BHL app ideas

Hero rosellasFollowing on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elywreplied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiversitylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

n24_w1150Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:
  1. Use BHL web server logs to find and extract referrals from those projects
  2. Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.





Duplicate DOIs for the same article: alias DOIs, who knew?

As part of a project to map taxonomic citations to bibliographic identifiers I'm tackling strings like this (from the ION record urn:lsid:organismnames.com:name:1405511 for Pseudomyrmex crudelis):

<tdwg_co:PublishedIn>
Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants. Zoological Journal of the Linnean Society, 126(4), August 1999: 451-540. 516 [Zoological Record Volume 136]
</tdwg_co:PublishedIn>

I parse the string into its components (e.g., journal, volume, issue, pagination) and use scripts to locate identifiers such as DOIs. I regard DOIs as the gold standard for bibliographic identifiers. The are (usually) unique, and CrossRef provides some really useful services to support them (DOIs now also support linked data if you are in to that sort of thing). Occasionally there are problems, such as duplicate DOIs when material moves from a publisher's site to, say, JSTOR. And some publishers are really, really bad at releasing DOIs that don't resolve. For example, Taylor & Francis Online have at least 18,000 DOIs for the Annals and Magazine of Natural History that don't resolve (e.g., doi:10.1080/00222933809512318 for this paper).

Sometimes my automated scripts for finding DOIs fail and I have to resort to Googling. To my surprise, I found two versions of the paper "Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants", each with a different DOI:


Now, this isn't supposed to happen. Interestingly, if you resolve doi:10.1006/zjls.1998.0158, either on the web or using CrossRef's OpenURL resolver, you get the page/metadata for doi:10.1111/j.1096-3642.1999.tb00157.x.

To see what was going on I fired up my local installation of Tony Hammnd's OpenHandle tool (see http://bioguid.info/openhandle/) and entered the Elsevier DOI (10.1006/zjls.1998.0158) and got this:


{
"comment" : "OpenHandle (JSON) - see http://code.google.com/p/openhandle/" ,
"handle" : "hdl:10.1006/zjls.1998.0158" ,
"handleStatus" : {
"code" : "1" ,
"message" : "SUCCESS"
} ,
"handleValues" : [
{
"index" : "100" ,
"type" : "HS_ADMIN" ,
"data" : {
"adminRef" : "hdl:10.1006/zjls.1998.0158?index=100" ,
"adminPermission" : "111111110111"
} ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Thu Apr 13 19:09:03 BST 2000" ,
"reference" : []
} ,
{
"index" : "1" ,
"type" : "URL" ,
"data" : "http://linkinghub.elsevier.com/retrieve/pii/S0024408298901583" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:12 BST 2003" ,
"reference" : []
} ,
{
"index" : "700050" ,
"type" : "700050" ,
"data" : "20030811104844000" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:16 BST 2003" ,
"reference" : []
} ,
{
"index" : "1970" ,
"type" : "HS_ALIAS" ,
"data" : "10.1111/j.1096-3642.1999.tb00157.x" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Mon Aug 25 21:06:50 BST 2008" ,
"reference" : []
}
]
}

The interesting bit is the "HS_ALIAS" at the bottom. I'd not come across this before, although it's in the spec (RFC 3651) for all to see (yeah, but who reads those?). The handle system that underlies DOIs has mechanism to support aliases, so that a DOI that originally pointed to a web page (say, for an article) can be redirected to point to another DOI. In this case, the Elsevier DOI redirects to the Wiley DOI ("10.1111/j.1096-3642.1999.tb00157.x" in the HS_ALIAS section), so the user ends up at Wiley's page for this article, not Elsevier's. This provides a way to accommodate changes in article ownership, without requiring an existing publisher to reuse the previous publisher's DOI.

In one sense this seems to defeat the point of DOIs, namely that they are effectively opaque identifiers that any publisher should be able to host. Perhaps in this case the issue is that the DOI prefix ("10.1006" and "10.1111" for Elsevier and Wiley, respectively) corresponds to a publisher, and when something goes wrong with a DOI it's easier to identify who is responsible based on this prefix, rather than the individual DOI.

In any event, next time I come across a duplicate DOI I'll need to check whether it is an alias of another DOI before launching into another rant about the (occasional) failings of DOIs.