Search this keyword

Showing posts with label matching. Show all posts
Showing posts with label matching. Show all posts

BioNames update - matching taxon names to classifications

On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority. In many cases this is straightforward, especially if there is only one taxon with a name. But often we have cases where the same name has been used more than once for different taxa. For example, here is what ION has for the name "Nystactes".
Nystactes Bohlke2735131
Nystactes2787598
Nystactes Gloger 18274888093
Nystactes Kaup 18294888094


If I want to map these names to GBIF then these are corresponding taxa with the name "Nystactes":
Nystactes Böhlke, 19572403398
Nystactes Gloger, 18272475109
Nystactes Kaup, 18293239722


Clearly the names are almost identical, but there are enough little differences (presence or absence of comma, "o" versus "ö") to make things interesting. To make the mapping I construct a bipartite graph where the nodes are taxon names, divided into two sets based on which database they came from. I then connect the nodes of the graph by edges, weighted by how similar the names are. For example, here is the graph for "Nystactes" (displayed using Google images:


I then compute the maximum weighted bipartite matching using a C++ program I wrote. This matching corresponds to the solid lines in the graph above.

In this way we can make a sensible guess as to how names in the two databases relate to one another.

Fuzzy matching taxonomic names using ngrams

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users. There are various ways to tackle this problem, and Tony Rees' TAXAMATCH is perhaps the best known solution.

Cloudant supports Lucence for full text searching, but while this allows some possibility for approximate matching (by appending "~" to the search string) initial experiments suggested it wasn't going to be terribly useful. What does seem to work is to use ngrams. As a crude example, here is a CouchDN view that converts a string (in this case a taxon name) to a series of trigrams (three letter strings) then indexes their concatenation.


{
"_id": "_design/taxonname",
"language": "javascript",
"indexes": {
"all": {
"index": "function(doc) { if (doc.docType == 'taxonName') { var n = doc.nameComplete.length; var ngrams = []; for (var i=0; i < n-2;i++) { var ngram = doc.nameComplete.charAt(i) + doc.nameComplete.charAt(i+1) + doc.nameComplete.charAt(i+2); ngrams.push(ngram); } if (n > 2) { ngrams.push('$' + doc.nameComplete.charAt(0) + doc.nameComplete.charAt(1)); ngrams.push(doc.nameComplete.charAt(n-2) + doc.nameComplete.charAt(n-1) + '$'); } ngrams.sort(); index(\"default\", ngrams.join(' '), {\"store\": \"yes\"}); } }"
}
}
}

To search this view for a name I then generate trigrams for the query string (e.g., "Pomatomix" becomes "$Po Pom oma mat ato tom omi mix ix$" where "$" signals the start or end of the string) and search on that. For example, append this string to the URL of the CouchDB database to search for "Pomatomix":


_design/taxonname/_search/all?q=$Po%20Pom%20oma%20mat%20ato%20tom%20omi%20mix%20ix$&include_docs=true&limit=10


Initial results are promising (searching on bigrams generated an alarming degree of matches that seemed rather dubious). I need to do some more work on this, but it might be a simple and quick way to support "did you mean?" for taxonomic names.

Automating the creation of geophylogenies: NEXUS + delimited text = KML

One thing which has always frustrated me about geophylogenies is how tedious they are to create. In theory, they should be pretty straightforward to generate. We take a tree, get point localities for each leaf in the tree, and generate the KML to display on Google Earth. The tedious part is getting the latitude and longitude data in the right format, and linking the leaves in the tree to the locality data.

To help reduce the tedium I've create a tool that tries to automate this as much as possible. The goal is to be able to paste in a NEXUS tree, and a table of localities, and get back a KML tree. Some publishers are making it easier to extract data from articles. For example, if you go to a paper such as http://dx.doi.org/10.1016/j.ympev.2009.07.011 you will see a widget on the right labelled Table download.

Elsevier

If you click on the Find tables button you can download the tables in CSV format. In this case, Table 1 has latitude and longitude data for all the taxa in the tree in TreeBASE study S10103. With some regular expressions we can figure out which column has the latitude and longitude data, and parse values like (10°12′N, 84°09′W) to extract the numerical values for latitude and longitude.

It is also pretty straightforward to be able to read a tree in NEXUS format and extract the taxon names. At this point we have two sets of names (those from the tree and those from the table) which might not be the same (in this case they aren't, we have "Craugastor cf. podiciferus FMNH 257672" and "FMNH 257672"). Matching these names up by hand would be tedious, but as described in Matching names in phylogeny data files we can use maximum weighted bipartite matching to compute an optimal matching between the two sets of labels.

Create KML tree

You can try the Create KML tree tool at http://iphylo.org/~rpage/phyloinformatics/kml/.

To get started, try it with the data below. In step 1 paste in the NEXUS tree, in step 2 paste in the table from the original paper. If all goes as it should, you will see a table displaying the matching, and the KML which you can save and open in Google Earth. If you have the Google Earth Plug-in installed, then you should see the KML displayed on Google Earth in your web browser.



I've tested the tool on only a few examples, so there will be cases where it fails. It also assumes that every taxon in the tree has latitude and longitude values, and that the first column in the table is the taxon name (you'll need to edit the file if this is not the case).

Here is the tree used in the example...


#NEXUS
BEGIN TREES;
TRANSLATE
Tl254954 'Craugastor cf. podiciferus FMNH 257672',
Tl254956 'Craugastor cf. podiciferus FMNH 257653',
Tl254965 'Craugastor cf. podiciferus UCR 16356',
Tl254960 'Craugastor sp. A USNM 563039',
Tl254938 'Craugastor sp. A USNM 563040',
Tl254945 'Craugastor cf. podiciferus UCR 16360',
Tl254928 'Craugastor cf. podiciferus UCR 17439',
Tl254959 'Craugastor cf. podiciferus UCR 17462',
Tl254951 'Craugastor cf. podiciferus FMNH 257596',
Tl254967 'Craugastor sp. A FMNH 257689',
Tl254934 'Craugastor cf. podiciferus UCR 16355',
Tl254964 'Craugastor cf. podiciferus FMNH 257671',
Tl254963 'Craugastor cf. podiciferus UCR 16358',
Tl254952 'Craugastor cf. podiciferus UCR 18062',
Tl254926 'Craugastor cf. podiciferus UCR 17442',
Tl254968 'Craugastor sp. A FMNH 257562',
Tl254939 'Craugastor cf. podiciferus UCR 17441',
Tl254946 'Craugastor cf. podiciferus FMNH 257757',
Tl254942 'Craugastor cf. podiciferus MVZ 149813',
Tl254961 'Craugastor cf. podiciferus FMNH 257595',
Tl254969 'Craugastor cf. podiciferus UCR 17469',
Tl254932 'Craugastor cf. podiciferus MVZ 164825',
Tl254970 'Craugastor sp. A AJC 0891',
Tl254943 'Craugastor cf. podiciferus UCR 16357',
Tl254929 'Craugastor cf. podiciferus FMNH 257673',
Tl254950 'Craugastor cf. podiciferus FMNH 257756',
Tl254944 'Craugastor cf. podiciferus FMNH 257652',
Tl254953 'Craugastor cf. podiciferus UCR 16359',
Tl254931 'Craugastor cf. podiciferus UCR 17443',
Tl254940 'Craugastor stejnegerianus UCR 16332',
Tl254935 'Craugastor underwoodi UCR 16315',
Tl254958 'Craugastor cf. podiciferus UCR 16354',
Tl254966 'Craugastor sp. A AJC 0890',
Tl254949 'Craugastor cf. podiciferus FMNH 257758',
Tl254933 'Craugastor cf. podiciferus UCR 16361',
Tl254962 'Craugastor cf. podiciferus FMNH 257651',
Tl254948 'Craugastor cf. podiciferus FMNH 257670',
Tl254971 'Craugastor cf. podiciferus FMNH 257669',
Tl254936 'Craugastor cf. podiciferus FMNH 257550',
Tl254957 'Craugastor underwoodi USNM 561403',
Tl254947 'Craugastor cf. podiciferus FMNH 257755',
Tl254927 'Craugastor cf. podiciferus UCR 16353',
Tl254925 'Craugastor bransfordii MVUP 1875',
Tl254930 'Craugastor cf. podiciferus UTA A 52449',
Tl254955 'Craugastor tabasarae MVUP 1720',
Tl254941 'Craugastor cf. longirostris FMNH 257678',
Tl254937 'Craugastor cf. longirostris FMNH 257561' ;
TREE 'Fig. 2' = ((Tl254955,(Tl254941,Tl254937)),(((((Tl254954,Tl254942,Tl254933,Tl254948,Tl254971),((Tl254934,Tl254958,Tl254927),((Tl254964,Tl254929),Tl254930))),(((Tl254965,(Tl254963,Tl254943)),(Tl254959,Tl254969),(Tl254951,Tl254961)),((Tl254928,Tl254926,Tl254939,Tl254931),(Tl254952,Tl254932)))),((((Tl254956,Tl254936),Tl254946,Tl254950,(Tl254944,Tl254962),Tl254947),Tl254949),(Tl254945,Tl254953))),((((Tl254960,Tl254938),(Tl254970,Tl254966)),(Tl254967,Tl254968)),((Tl254940,Tl254925),(Tl254935,Tl254957)))));
END;


...and here is the table:


Taxon and institutional vouchera,Locality ID,Collection localityb,Geographic coordinates/approximate location,Elevation (m),GenBank accession number12S,16S,COI,c-myc
1. UTA A-52449,1,"Puntarenas, CR","(10°18′N, 84°48′W)",1520,EF562312,EF562365,None,EF562417
2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430
3. FMNH 257669,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562320,EF562372,EF562380,EF562432
4. FMNH 257670,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562317,EF562336,EF562376,EF562421
5. FMNH 257671,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562314,EF562374,EF562409,None
6. FMNH 257672,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562318,None,EF562382,None
7. FMNH 257673,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562311,EF562343,EF562392,None
8. UCR 16361,3,"Alejuela, CR","(10°13′ N, 84°22′W)",1930,EF562321,EF562371,EF562375,EF562431
9. UCR 16353,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562313,EF562349,None,EF562420
10. UCR 16354,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562315,EF562363,None,EF562418
11. UCR 16355,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562316,EF562366,None,EF562419
12. UCR 18062,6,"Heredia, CR","(10°10′N, 84°06′W)",1900,EF562302,EF562342,EF562395,None
13. UCR 17439,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562298,EF562341,EF562387,EF562427
14. UCR 17441,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562299,EF562345,EF562388,EF562429
15. UCR 17442,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562300,EF562337,EF562385,EF562422
16. UCR 17443,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562301,EF562340,EF562384,EF562428
17. UCR 17462,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562309,EF562355,EF562406,EF562440
18. UCR 17469,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562310,EF562334,EF562405,EF562414
19. MVZ 164825,7,"Heredia, CR","(10° 05′N, 84° 04′W)",2100,EF562303,EF562346,EF562381,EF562423
20. UCR 16357,8,"San José, CR","(10°02′N, 83°57′W)",1600,EF562306,EF562339,EF562400,EF562433
21. UCR 16358,8,"San José, CR","(10°02′N, 83°57′W)",1600,EF562307,EF562370,EF562412,EF562415
22. UCR 16356,8,"San José, CR","(10°01′N, 83°56′W)",1940,EF562308,EF562329,None,None
23. UCR 16359,10,"San José, CR","(9°26′N, 83°41′W)",1313,EF562297,EF562369,EF562396,None
24. UCR 16360,10,"San José, CR","(9°26′N, 83°41′W)",1313,EF562296,EF562368,None,EF562434
25. FMNH 257595,9,"Cartago, CR","(9°44′N, 83°46′W)",1600,EF562304,EF562338,EF562408,None
26. FMNH 257596,9,"Cartago, CR","(9°44′N, 83°46′W)",1600,EF562305,EF562335,None,EF562416
27. FMNH 257550,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562294,EF562330,EF562393,EF562443
28. FMNH 257651,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562291,EF562367,EF562402,EF562435
29. FMNH 257652,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562288,EF562364,EF562390,None
30. FMNH 257653,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562292,EF562354,EF562392,EF562438
31. FMNH 257755,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562289,EF562344,EF562379,None
32. FMNH 257756,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562290,EF562347,EF562377,EF562413
33. FMNH 257757,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562293,EF562352,EF562383,EF562437
34. FMNH 257758,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562295,EF562348,EF562397,EF562436
35. USNM 563039,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562284,EF562356,EF562389,EF562445
36. USNM 563040,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562285,EF562350,EF562391,EF562439
37. AJC 0890,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562282,EF562351,EF562398,EF562444
38. MVUP 1880,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562283,EF562358,EF562399,EF562442
39. FMNH 257689,12,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562287,EF562353,EF562407,EF562446
40. FMNH 257562,12,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562286,EF562357,EF562410,EF562441
41. USNM 561403,N/A,"Heredia, CR","(10°24′N, 84°03′W)",800,EF562323,EF562361,EF562378,None
42. UCR 16315,N/A,"Alejuela, CR","(10°13′N, 84°35′W)",960,EF562322,EF562362,EF562394,None
43. UCR 16332,N/A,"San José, CR","(9°18′N, 83°46′W)",900,EF562325,EF562360,EF562411,AY211320
44. MVUP 1875 fitzingeri group,N/A,"BDT, PA","(9°24′N, 82°17′W)",50,EF562324,EF562359,None,AY211304
45. MVUP 1720,N/A,"Coclé, PA","(8°40′N, 80°35′W)",800,EF562326,EF562332,EF562401,EF562424
46. FMNH 257561,N/A,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562327,EF562331,None,EF562426
47. FMNH 257678,N/A,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562328,EF562333,EF562404,EF562425

Phantom articles: why Mendeley needs to make duplication transparent

Browsing Mendeley I found the following record: http://www.mendeley.com/research/description-larva/. This URL is for a paper
Costa, J. M., & Santos, T. C. (2008). Description of the larva of. Zootaxa, 99(2), 129-131
which apparently has the DOI doi:10.1645/GE-2580.1. This is strange because Zootaxa doesn't have DOIs. The DOI given resolves to a paper in the Journal of Parasitology:
Harriman, V. B., Galloway, T. D., Alisauskas, R. T., & Wobeser, G. A. (2011). Description of the larva of Ceratophyllus vagabundus vagabundus (Siphonaptera: Ceratophyllidae) from nests of Rossʼs and lesser snow geese in Nunavut, Canada. The Journal of parasitology, 93(2), 197-200
Now, this paper has it's own record in Mendeley.

OK, so this is weird..., but it gets weirder. If you look at the Mendeley page for this chimeric article there is a PDF preview of yet another article:
LOPES, Maria José Nascimento; FROEHLICH, Claudio Gilberto and DOMINGUEZ, Eduardo (2003). Description of the larva of Thraulodes schlingeri (Ephemeroptera, Leptophlebiidae). Iheringia, Sér. Zool. 92(2), 197-200 2003 doi:10.1590/S0073-47212003000200011
Mendeley duplicate

But it gets even more interesting. The abstract for the phantom Zootaxa article belongs to yet another paper:
Marques, K. I. D. S., & Xerez, R. D.Description of the larva of Popanomyia kerteszi James & Woodley (Diptera: Stratiomyidae) and identification key to immature stages of Pachygastrinae. Neotropical Entomology, 38(5), 643-648.
which also exists in Mendeley.

To investigate further I used Mendeley's API to retrieve this record (I had to look at the source of the web page to find the internal identifier used by Mendeley, namely 010c48d0-edb5-11df-99a6-0024e8453de6 to do this, why does Mendeley hide these?). Here's the abbreviated JSON for this record.

{
...
"website": "http:\/\/www.ncbi.nlm.nih.gov\/pubmed\/21506868",
"identifiers": {
"pmid": "21506868",
"issn": "19372345",
"doi": "10.1645\/GE-2580.1"
},
...
"issue": "2",
"pages": "129-131",
"public_file_hash": "fe7eed3f6c43a3be1480a0937229b9ad33666df4",
"publication_outlet": "Zootaxa",
"type": "Journal Article",
"mendeley_url": "http:\/\/www.mendeley.com\/research\/description-larva\/",
"uuid": "010c48d0-edb5-11df-99a6-0024e8453de6",
"authors": [
{
"forename": "J M",
"surname": "Costa"
},
{
"forename": "T C",
"surname": "Santos"
}
],
"title": "Description of the larva of",
"volume": "99",
"year": 2008,
"categories": [
39,
203,
37,
52,
43,
40,
210
],
"oa_journal": false
}

Doesn't add much to the story, but does give us the sha1 for the PDF for the chimeric article (fe7eed3f6c43a3be1480a0937229b9ad33666df4). If I download the PDF for the article in Iheringia, Sér. Zool. it has the same sha1:


openssl sha1 a11v93n2.pdf
SHA1(a11v93n2.pdf)= fe7eed3f6c43a3be1480a0937229b9ad33666df4

This article doesn't exist
So, to summarise, this paper doesn't exist. It is credited to a journal that doesn't have DOIs, the DOI resolves to an article in a different journal, the abstract comes from another article in another journal, and the PDF is from a third article. OMG!

This is just weird
So, something about the way Mendeley merges references is broken. Merging references is a tough problem so there will always be cases where things go wrong. But it would be really, really helpful if Mendeley could display the set of articles that it has merged to create each canonical reference (say by listing the UUIDs for each article). Users could then see if badness had happened, and provide feedback, for example by highlighting references that are clearly the same, and those that are clearly different. Until this happens I'm a bit nervous about trusting Mendeley with my bibliographic data, I don't want it mangled into chimeric papers that don't exist.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:
canopy.png
When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...

n-gram fulltext indexing in MySQL

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as:

Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl. Mus., 288: 177. Holotype: FMNH 69701, by original designation. Type locality: "in [Departamento] Huila, Colombia, at San Antonio, a small village 25 kilometers west of San Agustín, at 2,300 meters".


The journal that ASW abbreviates as "Bull. U.S. Natl. Mus." is in the BHL, which gives its title as "Bulletin - United States National Museum.". How do I link these two records? In my bioGUID OpenURL project I've been doing things like using SQL LIKE statements with periods (.) replaced by wildcards ('%') to find journal titles that match abbreviations (as well as building a database of these abbreviations). But this is error prone, and won't work for abbreviations such as "Bull. U.S. Natl. Mus." because the word "National" has been abbreviated to "Natl", which isn't a substring of "National".

After exploring various methods (including longest common subsequences, and sequence alignment algorithms) I came across a MySQL plugin for n-grams. The plugin tokenises strings into bi-grams (tokens with just two characters, see the Wikipedia page on N-grams for more information). This means that even though as words "National" and "Natl" are different, they will have some similarity due to the shared bi-grams "Na" and "at".

So, I grabbed the source for the plugin and the ICU dependency, compiled the plugin and added it to MySQL (I'm running MySQL 5.1.34 on Mac OS X 10.5.8). The plugin can be added while the MySQL server is running using this SQL command:

INSTALL PLUGIN bigram SONAME 'libftbigram.so';

Initial experiments seem promising. For the bhl_title table I created a bi-gram index:

ALTER TABLE `bhl_title` ADD FULLTEXT (`ShortTitle`) WITH PARSER bigram;

If I then take the abbreviation "Bull. U.S. Natl. Mus.", strip out the punctuation, and search for the resulting string ("Bull US Natl Mus")

SELECT TitleID, ShortTitle, MATCH(ShortTitle) AGAINST('Bull U S Natl Museum')
AS score FROM bhl_title
WHERE MATCH(ShortTitle) AGAINST('Bull U S Natl Museum') LIMIT 5;

I get this:
TitleIDShortTitle score
7548Bulletin - United States National Museum. 19.4019603729248
13855Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
14964Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
5943Bulletin du Muséum national d'histoire naturelle. 17.6493873596191
12908Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191


The journal we want is the top hit (if only just). I'll probably have to do some post-processing to check that the top hit makes sense (e.g., is it a supersequence of the search term?) but this looks like a promising way to match abbreviated journal names and book titles to records in BHL (and other databases).