Search this keyword

Mendeley, BHL, and the "Bibliography of Life"

One of my hobby horses is the disservice taxonomic databases do their users by not linking to original scientific literature. Typically, taxonomic databases either don't cite primary literature, or regurgitate citations as cryptic text strings, leaving the user to try and find item being referred to. With the growing number of publishers that are digitising legacy literature and issuing DOIs, together with the Biodiversity Heritage Library's (BHL) enormous archive, there's really no excuse for this.

Taxonomic databases often cite references in abbreviated forms, or refer to individual pages, rather than citable units such as articles (see my Nomenclators + digitised literature = fail post for details). One way to translate these into links to articles would be to have a tool that could find a page within an article, or could match an abbreviated citation to a full one. This task would be relatively straightforward if we had the "bibliography of life," a freely accessible bibliography of every taxonomic paper ever published. Sadly, we don't...yet.

Bibliography of life

Mendeley is rapidly building a very large bibliography (although exactly how large is a matter of some dispute, see Duncan Hull's How many unique papers are there in Mendeley?), and I'm starting to explore using it as a way to store bibliographic details on a large scale. For example, an increasing number of smaller museum or society journals are putting lists of all their published articles on the web. Typically these are HTML pages rather than bibliographic data, but with a bit of scraping we can convert them to something useful, such as RIS format and import them in to Mendeley. I've started to do this, creating Mendeley groups for individual journals, e.g.:

These lists aren't necessarily complete nor error-free, but they contain the metadata for several thousand articles. If individual societies and museums made their list of publications freely available we would make significant progress towards building a bibliography of life. And with the social networking features of Mendeley, we could have groups of collaborators clean up any errors in the metadata.

Of course, this isn't the only way to do this. I suspect I'm rather atypical in building Mendeley groups containing articles from only one journal, as opposed to groups based on specific topics, and of course we could also tackle the problem by creating groups with a taxonomic focus (such as all taxonomic papers on amphibians). Furthermore, if and when more taxonomists join Mendeley and share their personal bibliographies, we will get a lot more relevant articles "for free." This is Mendeley's real strength in my opinion: it provides rich tools for users to do what they most want to do (manage their PDFs and cite them when they write papers), but off the back of that Mendeley can support larger tasks (in the same way that Flickr's ability to store geotagged photos has lead to some very interesting visualisations of aggregated data).

BioStor
cover.png
For some of the journals I've added to Mendeley I just have bibliographic data, the actual content isn't freely available on line, and in some cases isn't event digitised. But for some journals the content exists in BHL, it's "just" a matter of finding it. This is where my BioStor project comes in. For example, BHL has scanned most of the journal Spixiana. While BHL recognises individual volumes (see http://www.biodiversitylibrary.org/bibliography/40214) it has no notion of articles. To find these I scraped the tables of contents on the Spixiana web site and ran them through BioStor's OpenURL resolver. If you visit the BioStor page for the journal (http://biostor.org/issn/0341-8391) you will see that most of the articles have been identified in BHL, although there are a few holes that will need to be filled.
spixiana.png

These articles are listed in a Mendeley group for Spixiana, with the articles linked to BioStor wherever possible.

CiteBank and on not reinventing the wheel
If we were to use Mendeley as the primary platform for aggregating taxonomic publications, then I see this as the best way to implement "CiteBank". BHL have created CiteBank as an "an open access repository for biodiversity publications" using Drupal. Whatever one thinks of Drupal, bibliographic management is not an area where it shines. I think the taxonomic community should take a good look at Mendeley and ask themselves whether this is the platform around which they could build the bibliography of life.

Towards an interactive DjVu file viewer for the BHL

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself. Because it's just a bitmap image there's no obvious way to do simple things such as select and copy some text, click on some text and correct the OCR, or highlight some text as a taxonomic name or bibliographic citation. This is frustrating, and greatly limits what we can do with BHL's content.

In March I wrote a short post DjVu XML to HTML showing how to pull out and display the text boxes for a DjVu file. I've put this example, together with links to the XSLT file I use to do the transformation online at Display text boxes in a DjVu page. Here's an example, where each box (a DIV element) corresponds to a fragment of text extracted by OCR software.

boxes.png

The next step is to make this interactive. Inspired by Google's Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?), I've revisited this problem. One thing the Google PDF viewer does nicely is enable you to select a block of text from a PDF page, in the same way that you can in a native PDF viewer such as Adobe Acrobat or Mac OS X Preview. It's quite a trick, because Google is displaying a bitmap image of the PDF page. So, can we do something similar for DjVu?

The thing I'd like to do is something what is shown below — drag a "rubber band" on the page and select all the text that falls within that rectangle:

textselection.png

This boils down to knowing for each text box whether it is inside or outside the selection rectangle:

selection.png


Implementation

We could try and solve this by brute force, that is, query each text box on the page to see whether it overlaps with the selection or not, but we can make use of a data structure called an R-tree to speed things up. I stumbled across Jon-Carlos Rivera's R-Tree Library for Javascript, and so was inspired to try and implement DjVu text selection in a web browser using this technique.

The basic approach is as follows:

  1. Extract text boxes from DjVu XML file and lay these over the scanned page image.

  2. Add each text box to a R-tree index, together with the "id" attribute of the corresponding DIV on the web page, and the OCR text string from that text box.

  3. Track mouse events on the page, when the user clicks with the mouse we create a selection rectangle ("rubber band"), and as the mouse moves we query the R-tree to discover which text boxes have any portion of their extent within the selection rectangle.

  4. Text boxes in the selection have their background colour set to an semi-transparent shade of blue, so that the user can see the extent of the selected text. Boxes outside the selection are hidden.

  5. When the user releases the mouse we get the list of text boxes from the R-tree, and concatenate the text corresponding to each box, and finally display the resulting selection to the user.



Copying text

So far so good, but what can we do with the selected text? One obvious thing would be to copy and paste it (for example, we could select a species distribution and paste it into a text editor). Since all we've done is highlight some DIVs on a web page, how can we get the browser to realise that it has some text it can copy to the clipboard? After browsing Stack Overflow I came across this question, which gives us some clues. It's a bit of a hack, but behind the page image I've hidden a TEXTAREA element, and when the user has selected some text I populate the TEXTAREA with the corresponding text, then set the browser's selection range to that text. As a consequence, the browser's Copy command (⌘C on a Mac) will copy the text to the clipboard.

Demo

You can view the demo here. It only works in Safari and Chrome, I've not had a chance to address cross-browser compatibility. It also works in the iPad, which seems a natural device to support interactive editing and annotation of BHL text, but you need to click on the button On iPad click here to select text before selecting text. This is an ugly hack, so I need to give a bit more thought to how to support the iPad touch screen, while still enabling users to pan and zoom the page image.

Next steps

This is all very crude, but I think it shows what can be done. There are some obvious next steps:

  • Enable selected text to be edited so that we can correct the underlying OCR text.

  • Add tools that operate on the selected text, such as check whether it is a taxonomic name, or if it is a bibliographic citation we could attempt to parse it and locate it online (such as David Shorthouse's reference parser).

  • Select parts of the page image itself, so that we could extract a figure or map.

  • Add "post it note" style annotations.

  • Add services that store the edits and annotations, and display annotations made by others.


Lots to do. I foresee a lot of Javascript hacking over the coming weeks.

Scripting Life

Not really a blog post, more a note to self. If I ever did get around to writing a book again, I think Scripting Life would be a great title.