Search this keyword

Biodiversity Heritage Library viewer experiments

In between the chaos that is term-time I've been playing with ways to view Biodiversity Heritage Library content. The viewer is crude, and likely to go off-line at any moment while I fuss with it, the you can view an example here. This link takes you to a display of BHL item 19513, which is volume 12 of the Bulletin of the British Museum (Natural History) Entomology, which includes some striking insects, such as the species of Lopheuthymia displayed below:

bhlviewer.png


The viewer attempts to do several things:
  1. Display a BHL document in a way that I can quickly scan pages looking for article boundaries by showing thumbnails
  2. Provide a simple way for me to edit metadata (if you find the start of an article you can click on the is the first page in an article and edit the article details)
  3. Provide a RIS dump of the articles which can then be uploaded into other bibliographic tools

Note that on the left hand side I'm displaying the articles that have already been found in the volume. The editing interface is crude, and I'll need to look at user authentication and versioning to do this seriously, but it seems a quick way to annotate BHL content. Much of this needs only be done once, as once we have article boundaries then searching BHL journal content using, say, OpenURL becomes easy, and we can link bibliographic records in nomenclators to BHL. Improving the metadata will also improve visualisation such as my BHL timeline.

I hope to play a bit more with the view over the next days and weeks. It's pretty simple (Javascript with PHP back end). The key to creating the viewer was a complete dump of BHL's page metadata kindly provided by Mike Lichtenberg. I use this to locate individual page images stored by the Internet Archive, which I then store locally (and generate 100 pixel wide thumbnails).

Potentially there's a lot more one could add to a tool like this. I'm playing with displaying the taxon names found by uBio so that I can flag instances where the page is where the name is first published. One could imagine adding other flags, such as when a taxon is depicted so that we could easily extract images of taxa from BHL content. It would also be nice to be able to add taxon names that uBio's algorithms have missed. Utlimately, one could even display the OCR text and correct/annotate that. Much to do...

n-gram fulltext indexing in MySQL

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as:

Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl. Mus., 288: 177. Holotype: FMNH 69701, by original designation. Type locality: "in [Departamento] Huila, Colombia, at San Antonio, a small village 25 kilometers west of San Agustín, at 2,300 meters".


The journal that ASW abbreviates as "Bull. U.S. Natl. Mus." is in the BHL, which gives its title as "Bulletin - United States National Museum.". How do I link these two records? In my bioGUID OpenURL project I've been doing things like using SQL LIKE statements with periods (.) replaced by wildcards ('%') to find journal titles that match abbreviations (as well as building a database of these abbreviations). But this is error prone, and won't work for abbreviations such as "Bull. U.S. Natl. Mus." because the word "National" has been abbreviated to "Natl", which isn't a substring of "National".

After exploring various methods (including longest common subsequences, and sequence alignment algorithms) I came across a MySQL plugin for n-grams. The plugin tokenises strings into bi-grams (tokens with just two characters, see the Wikipedia page on N-grams for more information). This means that even though as words "National" and "Natl" are different, they will have some similarity due to the shared bi-grams "Na" and "at".

So, I grabbed the source for the plugin and the ICU dependency, compiled the plugin and added it to MySQL (I'm running MySQL 5.1.34 on Mac OS X 10.5.8). The plugin can be added while the MySQL server is running using this SQL command:

INSTALL PLUGIN bigram SONAME 'libftbigram.so';

Initial experiments seem promising. For the bhl_title table I created a bi-gram index:

ALTER TABLE `bhl_title` ADD FULLTEXT (`ShortTitle`) WITH PARSER bigram;

If I then take the abbreviation "Bull. U.S. Natl. Mus.", strip out the punctuation, and search for the resulting string ("Bull US Natl Mus")

SELECT TitleID, ShortTitle, MATCH(ShortTitle) AGAINST('Bull U S Natl Museum')
AS score FROM bhl_title
WHERE MATCH(ShortTitle) AGAINST('Bull U S Natl Museum') LIMIT 5;

I get this:
TitleIDShortTitle score
7548Bulletin - United States National Museum. 19.4019603729248
13855Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
14964Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
5943Bulletin du Muséum national d'histoire naturelle. 17.6493873596191
12908Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191


The journal we want is the top hit (if only just). I'll probably have to do some post-processing to check that the top hit makes sense (e.g., is it a supersequence of the search term?) but this looks like a promising way to match abbreviated journal names and book titles to records in BHL (and other databases).



Memcached, Mac OS X, and PHP

Thinking about ways to improve the performance of some of my web servers I've begun to toy with Memcached. These notes are to remind me how to set it up (I'm using Mac OS X 10.5, Apache 2 and PHP 5.2.10, as provided by Apple). Erik's blog post Memcached with PHP on Mac OS X has a step-by-step guide, based on the post Setup a Memcached-Enabled MAMP Sandbox Environment by Nate Haug, and I've basically followed the steps they outline.
  1. Install the Memcached service on Mac OS X: Follow the instructions in Nate Haug's post.

  2. Install Memcache PHP Extension: Apple's PHP doesn't come with the PECL package for memcache so download it. To compile it go:

    phpize
    ./configure
    make
    sudo make install

    One important point. If you are running 64-bit Mac OS X (as I am), ./configure by itself won't build a usable extension. However, a comment by Matt on Erik's original post provides the solution. Instead of just ./configure, type this at the command prompt:

    MACOSX_DEPLOYMENT_TARGET=10.5 CFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe -no-cpp-precomp" CCFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe" CXXFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe" LDFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -bind_at_load" ./configure


    Then follow the rest of Erik's instructions for adding the extension to your copy of PHP.

  3. Restart Apache: You can do this by restarting Web sharing in System preferences. Use the phpinfo(); command to check that the extension is working. You should see something like this:
    memcache.png

    If you don't see this, something's gone wrong. The Apache web log may help (for example, that's where I discovered that I had the problem reported by several people who commented on Erik's post.

  4. You can start the memcached daemon like this:

    # /bin/sh
    memcached -m 1 -l 127.0.0.1 -p 11211 -d

Now, I just need to explore how to actually use this...