Search this keyword

Web Hooks and OpenURL: making databases editable

For me one of the most frustrating things about online databases is that they often can't be edited. For example, I've recently created a version of the Australian Faunal Directory on CouchDB, which contains a list of all animals in Australia, and a fairly comprehensive bibliography of taxonomic publication on those animals. What I'd like to do is locate those publications online. Using various scripts I've found DOIs for some 2,500 articles, and located nearly 4,900 article in BHL, and added these to the database, but browsing the database (using, say, the quantum treemap interface) makes it clear there are lots of publications that I've missed.

It would be great if I could go to the Australian Faunal Directory on CouchDB and edit these on that site, but that would require making the data editable, and that means adding a user interface. And that's potentially a lot of work. Then, if I go to another database (say, my CouchDB version of the Catalogue of Life) and want to make that editable then I have to add an interface to that database as well. I could switch to using a wiki, which I've done for some projects (such as the NCBI to Wikipedia mapping), but wikis have their own issues (in particular, they don't easily support the kinds of queries I want to do).

There is, as they say, a third way: web hooks. I first came across web hooks when I discovered that Post-Commit Web Hooks in Google Code. The idea is you can create a web service that gets called every time you commit code to the Google Code repository. For example, each time you commit code you can call a web hook that uses the Twitter API to tweet details of what you just committed (I tried this for a while, until some of my Twitter followers got seriously pissed off by the volume of tweets this was generating).

What has this to do with making databases editable? Well, imagine the following scenario. A web page displays a publication, but no DOI. However, the web page embeds an OpenURL in the form of a COinS (in other words, a URL with key-value pairs describing the publication). If you use a tool such as the OpenURL Referrer in Firefox you can use an OpenURL resolver to find that publication. Examples of OpenURL resolvers include bioGUID and BioStor. Let's say you find the publication, and it has a DOI. How do you tell the database about this? Well, you can try and find an email address of someone running the database so you can send them the information, but this is a hassle. What if the OpenURL resolver that you used to find the DOI could automatically tell the source database that it's found the DOI? That's the idea behind web hooks.

I've started to experiment with this, and have most of the pieces working. Publication pages in Australian Faunal Directory on CouchDB have COinS that include two additional pieces of information: (1) the database identifier for the publication (in this case a UUID, in the hideously complex jargon of OpenURL this the "Referring Entity Identifier"), and (2) the URL of the web hook. The idea is that an OpenURL resolver can take the OpenURL and try and locate the article. If it succeeds it will call the web hook URL supplied by the database, tell it "hey, I've found this DOI for the publication with this database identifier". The database can then update its data, so the next time a user visits the page for that publication in the database, the user will see the DOI. This has the huge advantage over tools that just modify the web page on the fly, such as David Shorthouse's reference parser of persistence: the database itself is updated, not just the web page.

In order to make this work, all the database needs to do is have a web hook, namely a URL that accepts POST requests. The heavy lifting of searching for the publication, or enabling users to correct and edit the data can be devolved to a single place, namely the OpenURL resolver. As a first step I'm building an OpenURL resolver that displays a form the in which the user can edit bibliographic details, and launch searches in CrossRef (and soon BioStor). When the user is done they can close the form, which is when it calls the web hook with the edited data. The database can then choose to accept or reject the update.

Given that it's easy to create the web hook, and trivial to get a database to output an OpenURL with its internal identifier and the URL of the web hook, this seems like a light-weight way of making databases editable.

Quantum treemaps meet BHL and the Australian Faunal Directory

One of the things I'm enjoying about the Australian Faunal Directory on CouchDB is the chance to play with some ideas without worrying about breaking lots of code or, indeed, upsetting any users ('cos, let's face it, there aren't any). As a result, I can start to play with ideas that may one day find their way into other projects.

One of these ideas is to use quantum treemaps to display an author's publications. For example, below is a treemap showing publications by G A Boulenger in my Australian Faunal Directory on CouchDB project. The publications are clustered by journal. If a publication has been found in BioStor the treemap displays a thumbnail of that publication, otherwise it shows a white rectangle. At a glance we can see where the gaps are. You can view a publication's details simply by clicking on it.

boulenger.png

The entomologist W L Distant has a more impressive treemap, and clearly I need to find quite a few of his publications.
distant.png
I quite like the look of these, so may think about adding this display to BioStor. I may also think about using treemaps in my ongoing iPad projects. If you want to see where I'm going with this then take a look at Good et al. A fluid treemap interface for personal digital libraries.

Notes
The quantum treemap is computed using some rather ugly PHP I wrote, based on this Java code. I've not implemented all the refinements of the original Java code, so the quantum treemaps I create are sometimes suboptimal. To avoid too much visual cluster I haven't drawn a border around each cell, instead I use CSS gradients to indicate the area of the cell (if you're using Internet Explorer the gradient will be vertical rather than going from top left to bottom right). The journal name is overlain on the cell contents, but if you are using a decent browser (i.e., not Internet Explorer) you can still click through this text to the underlying thumbnail because the text uses the CSS property
.overlay { pointer-events: none; }
I learnt this trick from the Stack Overflow question Click through div with an alpha channel.

The demise of phthiraptera.org and the perils of using Internet domain names as identifiers

When otherwise sensible technorati refer to "owning" a domain name, it makes me want to stick forks in my eyeballs. We do not "own" domain names. At best, we only lease them and there are manifold ways in which we could lose control of a domain name - through litigation, through forgetfulness, through poverty, through voluntary transfer, etc. Once you don't control a domain name anymore, then you can't control your domain-name-based persistent identifiers either. - Geoffrey Bilder interviewed by Martin Fenner
Geoffery Bilder's comments about the unsuitability of URLs as long term identifiers (as opposed, say, to DOIs) came to mind when I discovered that the domain phthiraptera.org is up for sale:

Snapshot 2011-01-14 07-47-39.png

This domain used to be home to a wealth of resources on lice (order Phthiraptera). I discovered that ownership of the domain had expired when a bunch of links to PDFs returned by an iSpecies search for Collodennyus all bounced to the holding page above. Phthiraptera.org was owned by the late Bob Dalgleish. After his death, ownership of the domain lapsed, and it's now up for sale. Although much of the content of Phthiraptera.org has been moved to phthiraptera.info, URLs containing phthiraptera.org still turn up in search results, especially ones that have been cached (for example, in iSpecies). Given that much of the content is still available the loss isn't total, but anyone relying on links containing phthiraptera.org to point to content (such as a PDF), or to identify that content (such as a publication) will find themselves in trouble. Although ideally Cool URIs don't change, in practice they do, and with alarming frequency. Furthermore, in this case, because ownership of phthiraptera.org has lapsed, there's no opportunity to create redirects from URLs with phthiraptera.org to the equivalent content in phthiraptera.info (leaving aside the issue that phthiraptera.info is not a mirror of phthiraptera.org, so exactly what the redirects would point to is unclear).

Identifiers based on domain names, such as URLs and LSIDs are attractive because the DNS helps ensure global uniqueness, and HTTP provides a way to resolve the identifier, but all this is contingent on the domain itself persisting. For more on this topic I recommend reading Martin Fenner's interview of CrossRef's Geoffrey Bilder, from which I took the opening quote.