Search this keyword

Showing posts with label taxonomic name. Show all posts
Showing posts with label taxonomic name. Show all posts

BioNames update - matching taxon names to classifications

On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority. In many cases this is straightforward, especially if there is only one taxon with a name. But often we have cases where the same name has been used more than once for different taxa. For example, here is what ION has for the name "Nystactes".
Nystactes Bohlke2735131
Nystactes2787598
Nystactes Gloger 18274888093
Nystactes Kaup 18294888094


If I want to map these names to GBIF then these are corresponding taxa with the name "Nystactes":
Nystactes Böhlke, 19572403398
Nystactes Gloger, 18272475109
Nystactes Kaup, 18293239722


Clearly the names are almost identical, but there are enough little differences (presence or absence of comma, "o" versus "ö") to make things interesting. To make the mapping I construct a bipartite graph where the nodes are taxon names, divided into two sets based on which database they came from. I then connect the nodes of the graph by edges, weighted by how similar the names are. For example, here is the graph for "Nystactes" (displayed using Google images:


I then compute the maximum weighted bipartite matching using a C++ program I wrote. This matching corresponds to the solid lines in the graph above.

In this way we can make a sensible guess as to how names in the two databases relate to one another.

ZooBank data model

I'm trying to get my head around the data model used by ZooBank to store taxonomic names. To do this, I've built a graph for the species Belonoperca pylei described by Baldwin & Smith described in:
Baldwin, C. C., & Smith, W. L. (1998). Belonoperca pylei, a new species of seabass (Teleostei: Serranidae: Epinephelinae: Diploprionini) from the cook islands with comments on relationships among diploprionins. Ichthyological Research, 45(4), 325–339. doi:10.1007/BF02725185

After extracting some data from ZooBank API I created a DOT file connecting the various "taxon name usages" associated with Belonoperca pylei and constructed a graph using GraphViz:
Zoobank
You can grab the DOT file here, and a bigger version of the image is on Flickr. I've labelled taxon names and references with plain text as well as the UUIDs that serve as identifiers in ZooBank. (Update: the original diagram had Belonoperca pylei Baldwin & Smith, 1998 sensu Eschmeyer [9F53EF10-30EE-4445-A071-6112D998B09B] in the wrong place, which I've now fixed.)

This is a fairly simple case of a single species, but it's already starting to look a tad complicated. We have Belonoperca pylei Baldwin & Smith, 1998 linked to its original description (doi:10.1007/BF02725185) and to the genus Belonoperca Fowler & Bean, 1930 (linked to its original publication http://biostor.org/reference/105997) as interpreted by ("sensu") Baldwin & Smith, 1998. Belonoperca Fowler & Bean 1930 sensu Baldwin & Smith 1998 is linked to the original use of that genus (i.e., Belonoperca Fowler & Bean, 1930). Then we have the species Belonoperca pylei Baldwin & Smith, 1998 as understood in Eschmeyer's 2004 checklist.

Notice that each usage of a taxon name gets linked back to a previous usage, and names are linked to higher names in a taxonomic hierarchy. When the species Belonoperca pylei was described it was placed in the genus Belonoperca, when Belonoperca was described it was placed in the family Serranidae, and so on.

Fuzzy matching taxonomic names using ngrams

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users. There are various ways to tackle this problem, and Tony Rees' TAXAMATCH is perhaps the best known solution.

Cloudant supports Lucence for full text searching, but while this allows some possibility for approximate matching (by appending "~" to the search string) initial experiments suggested it wasn't going to be terribly useful. What does seem to work is to use ngrams. As a crude example, here is a CouchDN view that converts a string (in this case a taxon name) to a series of trigrams (three letter strings) then indexes their concatenation.


{
"_id": "_design/taxonname",
"language": "javascript",
"indexes": {
"all": {
"index": "function(doc) { if (doc.docType == 'taxonName') { var n = doc.nameComplete.length; var ngrams = []; for (var i=0; i < n-2;i++) { var ngram = doc.nameComplete.charAt(i) + doc.nameComplete.charAt(i+1) + doc.nameComplete.charAt(i+2); ngrams.push(ngram); } if (n > 2) { ngrams.push('$' + doc.nameComplete.charAt(0) + doc.nameComplete.charAt(1)); ngrams.push(doc.nameComplete.charAt(n-2) + doc.nameComplete.charAt(n-1) + '$'); } ngrams.sort(); index(\"default\", ngrams.join(' '), {\"store\": \"yes\"}); } }"
}
}
}

To search this view for a name I then generate trigrams for the query string (e.g., "Pomatomix" becomes "$Po Pom oma mat ato tom omi mix ix$" where "$" signals the start or end of the string) and search on that. For example, append this string to the URL of the CouchDB database to search for "Pomatomix":


_design/taxonname/_search/all?q=$Po%20Pom%20oma%20mat%20ato%20tom%20omi%20mix%20ix$&include_docs=true&limit=10


Initial results are promising (searching on bigrams generated an alarming degree of matches that seemed rather dubious). I need to do some more work on this, but it might be a simple and quick way to support "did you mean?" for taxonomic names.

Taxonomy and the nine billion names of God

In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end. As the understandably skeptical programmers leave having completed their task, they look up into the sky and notice that "overhead, without any fuss, the stars were going out."

Leaving aside the delicious irony that arises if we recast this story with the monks replaced by taxonomists, much of our work with taxonomic names seems to be enumerating endless permutations of the same names. Part of the problem is the way some databases store and provide access to names.


The simplest way to represent a taxonomic name is to just have the name (the "canonical name"), without additional bits such as the taxonomic authority. In my view, any taxonomic database that serves names should provide the canonical name. I'm not arguing that they shouldn't provide taxonomic authority information (ideally separately, but could also be as part of a canonical name + authority string), I just want them to also provide just the canonical name. For some reason this seems to upset people (e.g., this thread on the TDWG mailing lists), so let me explain why I think this matters.

Most people use taxonomic names without the authority (just Google a taxonomic name with and without it's authority and compare the number of hits). So, if your goal is to be of service to your users, make sure you provide the canonical name.

Then there is the issue of integrating data from different sources. The more parts to the name the more scope there is for ambiguity. For example, my first ever publication was a description of a new species of peacrab, Pinnotheres atrinicola, published in:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904

If we look for this name in ION we discover three records:

Pinnotheres atrinacolaurn:lsid:organismnames.com:name:1192320
Pinnotheres atrinicolaurn:lsid:organismnames.com:name:371872
Pinnotheres atrinicola Page 1983urn:lsid:organismnames.com:name:371873


Two are duplicates of "Pinnotheres atrinicola", with and without the authority, one is a misspelling ("Pinnotheres atrinacola"). Given just the name we already see that it's easy for people to get the spelling wrong and generate lexical variants.

If we now add the authority we get more potential for variation. ION write the authority as "Page 1983" (no comma), but other databases such as WoRMS write it as Page, 1983 (with comma). So we now have two variations of the name, and two for the authority, so 4 possible strings if we include both name and authority. This combinatorial explosion means that we can rapidly generate lots of strings that are fundamentally the same.

I'm not arguing that taxonomic authorities aren't useful, and I want them wherever they are known, but insisting that databases serve name + authority to the exclusion of just the canonical name is a recipe for disaster. One could argue that users can parse the string into name and authority components, but that's a headache (just take a look at taxon-name-processing for details). Why make users go through hoops to get basic information?

Another reason I'm wary of taxonomic authority strings is that people don't always understand the conventions. For example, in my previous post I used the following example for names that differed in authority string:

  • Demansia torquata Günther 1862
  • Demansia torquata (Günther, 1862)

The use of parentheses seems a small difference, but (a) it means the strings are different, and (b) the presence or absence of parentheses changes the meaning of the authority. In this example, Demansia torquata Günther 1862 means that Günther is the original author of the name Demansia torquata, and so if I search Günther's publications from 1862 for "Demansia torquata" I will find that name. Demansia torquata (Günther, 1862), on the other hand, means that Günther originally described this species in 1862, but he placed it in a different genus, so my search for "Demansia torquata" in 1862 is likely to be fruitless. So, if the authority is actually (Günther, 1862) but a database tells me it's Günther, 1862 I'd be wasting my time looking for the name in 1862.

As it turns out, this snake was originally described as Diemansia torquata (see "On new species of snakes in the collection of the British Museum" http://biostor.org/reference/50221). The genus name Diemansia differs from Demansia, hence (Günther, 1862) should be correct, but it looks like Diemansia and Demansia are just some of the variations of the same snake genus (see for example http://biodiversitylibrary.org/page/22393791). *Sigh*

Variation in taxonomic authority extends beyond parentheses. In a post on clustering strings I used examples of taxonomic authorities for the genus Helicella:

Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821

There are six different strings here which correspond to three different authorities. In this example the name Helicella is a homonym (same name used for different taxa) so having the taxonomic authority can help decide which name is actually meant, but people can't seem to agree on how to spell the authority names, and in other cases they might not agree on dates of publication, hence we get variations such as those above. Even when authorities are useful, they come at a cost. And that's not even considering chresonyms where the authority isn't the original author, but instead is a form of citation of the use of a name.

All of this variation is a cause of ambiguity, and when we combine permutations of taxonomic names and taxonomic authorities, things start to get messy. Indeed, I'd argue that projects such as the Global Names Index (GNI) are essentially doing what Arthur C. Clarke's monks were doing, trying to capture near endless permutations of the same names. Given this, it seems crazy not to try and keep things as simple as possible. In the vast majority of cases I want the name, I don't want the rest of the cruff attached to it. Taxonomic authorities are really just proxies for citation, so lets focus on getting that information linked to names, and stop making life difficult for users.

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

RefineGoogle Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.

The services I've implemented so far are:
  • EOL http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
  • NCBI taxonomy http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php
  • uBio FindIT http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php
  • WORMS http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php
  • GBIF http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php
  • Global Names Index http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php


To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".

How to use reconciliation services

Start a Google Refine session. Save the names below to a text file and open it as a new project.

Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae


You should see something like this:
Refine1

Click on the column header Names and choose ReconcileStart reconciling.

Refine2

A dialog will popup asking you to select a service.

Refine3

If you've already added a service it will be in the list on the left. If not, click the Add Standard Services... button at the bottom left and paste in the URL (in this case http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).

Once the service has loaded click on Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Refine4

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the Names column header and choose Edit columnAdd column based on this column.... A dialog box will be displayed:

Refine6

In the box labelled Expression enter cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

Refine7

You could also get the names uBio extracted by creating a column based on the values of cell.recon.match.name. To compare this with the original values, click on the Names column header and choose ReconcileActionsClear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

Refine8

These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).

Development notes
The code for these services is in Github (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per David Huynh's tip helped) things work pretty well.

Journals I'd like BHL to scan

I've recently updated my database of links between animal taxonomic names and literature identifiers, which now has over 280,000 names linked to some form of identifier (127,000 of these being DOIs). You can see the current version here:

http://iphylo.org/~rpage/itaxon/

As an experiment I've added a feature to list the number of names for each journal. Based on this list (limited to journals that I've found an ISSN for) here are some journals I'd like to see digitised by the Biodiversity Heritage Library (BHL). Note that by digitised I mean beyond the 1923 cutoff applied to many journals. This will mean negotiating with the journal publishers, but in a number of cases these are scientific societies or institutions, some associated with BHL. Given that major partners in BHL have made post-1923 content available, it would nice to extend this to other key taxonomic journals.

Revue Suisse de Zoologie

Revue Suisse de Zoologie has published nearly 10,000 taxonomic names but has essentially zero digital presence, which is extraordinary. Another Swiss journal, Entomologica Basiliensia is also an obvious candidate.

Revue de Zoologie et de Botanique Africaines

Revue de Zoologie et de Botanique Africaines has published over 5,000 names, and given the interest in providing information resources for Africa (e.g., http://www.mendeley.com/groups/1681811/bhl-africa/) this seems an obvious journal to scan completely.

Bulletin of the British Museum (Natural History) journals and books

The Natural History Museum [formerly British Museum (Natural History)] is a member of BHL so I'd expect it to have better coverage of it's own publications in BHL. There are gaps in journals such as Bulletin of the British Museum (Natural History) Entomology, which means there is a significant chunk of research published by Museum staff that simply doesn't exist digitally. At one point The Natural History Museum renamed the journals and moved them to Cambridge University Press, resulting in further gaps in digitisation. It's interesting that museums that haven't changed the title of their publications (such as the American Museum of Natural History and the Australian Museum) have better digital coverage than the NHM, which has flirted with various title changes in the last few decades. The Museum also published a series of monographs in the 20th century, many of these aren't in BHL.

Memoirs of the Queensland Museum

The Memoirs of the Queensland Museum is an important journal (> 3,000 names) but has only early issues scanned in BHL and recent issues as PDFs on the Museum web site (vulnerable to link rot when the site gets redesigned, as I've discovered to my cost).

Russian journals

Russian journals contain large numbers of taxonomic descriptions, but their digital presence is patchy. Springer has started to publish translations online (e.g., http://dx.doi.org/10.1134/S0013873810050155 in Entomological Review, which is a translation of an article in Zoologicheskii Zhurnal), but much of the Russian literature seems unavailable in digital form. BHL has spread from it's US-UK origins to BHL-Europe, BHL_China, and BHL_Australia, maybe it's time for BHL-Russia?

Summary

There are huge holes in the availability of taxonomic literature (where I equate "availability" with being digitised and online, free or otherwise). But on the other hand I've been pleasantly surprised by just how much taxonomic literature is online. It looks quite feasible to link at least 300,000 animal names to digital publications.

The journals I've highlighted are just a few obvious candidate for scanning. I suspect that as one goes down the list of taxonomic journals the rate of return will decline, to the point where scanning entire journals will be less efficient than scanning targeted articles.