Search this keyword

n-gram fulltext indexing in MySQL

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as:

Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl. Mus., 288: 177. Holotype: FMNH 69701, by original designation. Type locality: "in [Departamento] Huila, Colombia, at San Antonio, a small village 25 kilometers west of San Agustín, at 2,300 meters".


The journal that ASW abbreviates as "Bull. U.S. Natl. Mus." is in the BHL, which gives its title as "Bulletin - United States National Museum.". How do I link these two records? In my bioGUID OpenURL project I've been doing things like using SQL LIKE statements with periods (.) replaced by wildcards ('%') to find journal titles that match abbreviations (as well as building a database of these abbreviations). But this is error prone, and won't work for abbreviations such as "Bull. U.S. Natl. Mus." because the word "National" has been abbreviated to "Natl", which isn't a substring of "National".

After exploring various methods (including longest common subsequences, and sequence alignment algorithms) I came across a MySQL plugin for n-grams. The plugin tokenises strings into bi-grams (tokens with just two characters, see the Wikipedia page on N-grams for more information). This means that even though as words "National" and "Natl" are different, they will have some similarity due to the shared bi-grams "Na" and "at".

So, I grabbed the source for the plugin and the ICU dependency, compiled the plugin and added it to MySQL (I'm running MySQL 5.1.34 on Mac OS X 10.5.8). The plugin can be added while the MySQL server is running using this SQL command:

INSTALL PLUGIN bigram SONAME 'libftbigram.so';

Initial experiments seem promising. For the bhl_title table I created a bi-gram index:

ALTER TABLE `bhl_title` ADD FULLTEXT (`ShortTitle`) WITH PARSER bigram;

If I then take the abbreviation "Bull. U.S. Natl. Mus.", strip out the punctuation, and search for the resulting string ("Bull US Natl Mus")

SELECT TitleID, ShortTitle, MATCH(ShortTitle) AGAINST('Bull U S Natl Museum')
AS score FROM bhl_title
WHERE MATCH(ShortTitle) AGAINST('Bull U S Natl Museum') LIMIT 5;

I get this:
TitleIDShortTitle score
7548Bulletin - United States National Museum. 19.4019603729248
13855Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
14964Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
5943Bulletin du Muséum national d'histoire naturelle. 17.6493873596191
12908Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191


The journal we want is the top hit (if only just). I'll probably have to do some post-processing to check that the top hit makes sense (e.g., is it a supersequence of the search term?) but this looks like a promising way to match abbreviated journal names and book titles to records in BHL (and other databases).



Memcached, Mac OS X, and PHP

Thinking about ways to improve the performance of some of my web servers I've begun to toy with Memcached. These notes are to remind me how to set it up (I'm using Mac OS X 10.5, Apache 2 and PHP 5.2.10, as provided by Apple). Erik's blog post Memcached with PHP on Mac OS X has a step-by-step guide, based on the post Setup a Memcached-Enabled MAMP Sandbox Environment by Nate Haug, and I've basically followed the steps they outline.
  1. Install the Memcached service on Mac OS X: Follow the instructions in Nate Haug's post.

  2. Install Memcache PHP Extension: Apple's PHP doesn't come with the PECL package for memcache so download it. To compile it go:

    phpize
    ./configure
    make
    sudo make install

    One important point. If you are running 64-bit Mac OS X (as I am), ./configure by itself won't build a usable extension. However, a comment by Matt on Erik's original post provides the solution. Instead of just ./configure, type this at the command prompt:

    MACOSX_DEPLOYMENT_TARGET=10.5 CFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe -no-cpp-precomp" CCFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe" CXXFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -g -Os -pipe" LDFLAGS="-arch ppc -arch ppc64 -arch i386 -arch x86_64 -bind_at_load" ./configure


    Then follow the rest of Erik's instructions for adding the extension to your copy of PHP.

  3. Restart Apache: You can do this by restarting Web sharing in System preferences. Use the phpinfo(); command to check that the extension is working. You should see something like this:
    memcache.png

    If you don't see this, something's gone wrong. The Apache web log may help (for example, that's where I discovered that I had the problem reported by several people who commented on Erik's post.

  4. You can start the memcached daemon like this:

    # /bin/sh
    memcached -m 1 -l 127.0.0.1 -p 11211 -d

Now, I just need to explore how to actually use this...

Linking Bulletin of Zoological Nomenclature to BHL

5839_200px.1254992426.png

After some fussing and hair pulling I've constructed a demo of linking a journal to the Biodiversity Heritage Library and displaying the results in Zotero (see my earlier post for rationale).

After some searching I managed to retrieve metadata for several hundred article from the Bulletin of Zoological Nomenclature. Using a local copy of the BHL metadata, I wrote a script that looked up each article in BHL and found the URL of the first page in the article. I then created a Zotero group for this journal and uploaded the linked references.

You can browse the group, and if you belong to Zotero you can join the group and get a local copy of the references (which you can edit and correct, the harvesting won't be perfect).

I've also added these references to my
bioGUID OpenURL resolver, making it easy to find a given article. For example, the OpenURL link http://bioguid.info/openurl/?genre=article&issn=0007-5167&volume=51&spage=7 displays a page for the article "Doris grandiflora Rapp, 1827 (currently Dendrodoris grandiflora; Mollusca, Gastropoda): proposed conservation of the specific name", together with a link to the article in BHL.