Search this keyword

The Mendeley API Binary Battle - win $US 10,001

Now we'll bring the awesome. Mendeley have announced The Mendeley API Binary Battle, with a first prize of $US 10,0001, and some very high-profile judges (Juan Enriquez, Tim O'Reilly, James Powell, Werner Vogels, and John Wilbanks). Deadline for submission is August 31st 2011, with the results announced in October.

The criterion for judging are:
  1. How active is your application? We’ll look at your API key usage.

  2. How viral is the app? We’ll look at the number of sign ups on Mendeley and/or your application, and we’ll also have an eye on Twitter.

  3. Does the application increase collaboration and/or transparency? We’ll look at how much your application contributes to making science more open.

  4. How cool is your app? Does it make our jaws drop? Is it the most fun that you can have with your pants on? Is it making use of Facebook, Twitter, etc.?

  5. The Binary Battle is open to apps built previous to this announcement.


Start your engines...

Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature

Following on from my previous post on microcitations I've blasted all the citations in Nomenclator Zoologicus through my microcitation service and created a simple web site where these results can be browsed.

The web site is here: http://iphylo.org/~rpage/nz/.

To create it I've taken a file dump of Nomenclator Zoologicus provided by Dave Remsen and run all the citations through the microcitation service, storing the results in a simple database. You can search by genus name, author and year, or publication. The search is pretty crude, and in the case of publications can be a bit hit and miss. Citations in Nomenclator Zoologicus are stored as strings, so I've used some crude rules to try and extract the publication name from the rest of the details (such as page numbering).

To get started, you can look at names published by published by Distant in 1910, which you can see below:

Nz1

If the citation has been found you can click on the icon to view the page in a popup, like this:

Nz2

You can also click on the page number to be taken to that page in BHL.


I've also added some other links, such as to the name in the Index to Organism Names, as well as bibliographic identifiers such as DOIs, Handles, and links to JSTOR and CiNii.

So far only 10% of Nomenclator Zoologicus records have a match in BHL, which is slightly depressing. Browsing through there are some obvious gaps where my parser clearly failed, typically where multiple pages are included in the citation, or the citation has some additional comments. These could be fixed. There are also cases where the OCR text is so mangled that a match has been rejected because the genus name and text were too different.

This has been hastily assembled, but it's one vision of a simple service where we can go from genus name to being able to see the original publication of that name. There are other things we could do with this mapping, such as enabling BHL to tell users that the reference they are looking at is the original source of a particular name, and enabling services that use BHL content (such as EOL and Atlas of Living Australia to flag which reference in BHL is the one that matters in terms of nomenclature.

Microcitations: linking nomenclators to BHL

One of the challenges of linking databases of taxonomic names to the primary literature is the minimal citation style used by nomenclators (see my earlier post Nomenclators + digitised literature = fail).

For example, consider Nomenclator Zoologicus. Volumes 1-10 of this list of generic names in zoology were digitised in 2004 and put online by uBio (for more details of this project see Taxonomic informatics tools for the electronic Nomenclator Zoologicus, pmid:16501061). In Nomenclator Zoologicus the citation for the genus Abana is:

Ann. Mag. nat. Hist., (8) 2, 72.

The challenge is to link this short citation to the digital version of the corresponding article. I've been sitting on a copy of the digitised Nomenclator Zoologicus kindly provided by Dave Remsen, and I've finally started to look at the problem of mining it for links to databases such as BHL.

You can see the first attempt at http://biostor.org/microcitation.php. This form takes a genus name and the short citation and attempts to locate the corresponding page in BHL. It then checks whether the name is present on that page. Locating a page in a journal can be a challenge given the often rather ropey metadata in BHL, but BioStor uses a combination of fuzzy string matching and crude kludges to find the best match. But a further complication is that OCR errors may mean the taxonomic name we are looking for might not be detected on the page.

For example, if we search for the citation for the genus Aethriscus, Ann. Mag. nat. Hist., (7) 10, 329. we find two candidate pages in the journal Ann. Mag. nat. Hist, but neither contains the string "Aethriscus". However, if we use approximate string matching we find the OCR text for one page has the string "thriscus". This differs by only two characters from "Aethriscus", and so is a possible match (shown in orange).

2.png

Looking at the scanned page we can see the likely source of the problem:
3.png

In the original publication the name Aethriscus was written as Æthriscus. The ligature Æ has been corrupted by the OCR engine, and in Nomenclator Zoologicus the name is written without the ligature, hence the failure to exactly match the name with the text. These are some of the challenges faced when trying to close the circle and link names to literature.

The microcitation parser is still pretty crude, but usable. You can get results in either HTML or JSON, so the task of mapping microcitations to BHL pages can be automated. At present the name matching assumes you are looking at a single word (e.g., a genus), I need to extend it to handle binomials.