Search this keyword

Linking GBIF and the Biodiversity Heritage Library

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates. Given that many articles in BioStor list museum specimens I wrote some code to extract these (see Extracting museum specimen codes from text) and applied this to the OCR text for those articles.

Having a list of specimens is nice, but in this digital age I want to be able to find out more about these specimens. An obvious solution is try and match these specimen codes to the specimen records held by GBIF. Linking to GBIF is complicated by the fact that museum codes are not unique. For example, "FMNH 147942" could refer to a bird, an amphibian, or a mammal. To tackle the non uniqueness I use the taxonomic names extracted from each page by BHL to work out what taxon an article is mainly "about". To do this I use the Catalogue of Life classification to get "paths" for each name (i.e., the lineage of each taxon down to the root of the classification) and then find the majority-rule path. You can see these paths in the "Taxonomic classification" displayed on a page for a BioStor article. If there are multiple GBIF specimens for the same code I test whether the taxon or rank "class" in the GBIF record is in the majority-rule path for the article. If so, I accept that specimen as the match to the code.

There are also issues where the specimen codes in GBIF have been modified during input (e.g., USNM 730715 has become USNM 730715.457409). There are also the inevitable OCR errors that may cause museum codes to be missed or otherwise corrupted. Bearing all this in mind, BioStor now has specimen pages (these are still being generated as I write this). For example, the page for FMNH 147942 lists the three articles in BioStor that cite this specimen code:

Fmnh147942

All three specimens have been mapped on to GBIF occurrence http://data.gbif.org/occurrences/61846037/. When BioStor displays the articles it now lists the specimen codes that have been extracted from the article, together with the GBIF logo if the specimen has been matched to a GBIF record. For example, here is a screenshot from Deep-water octopods (Mollusca: Cephalopoda) of the northeastern Pacific:
Deepwater

The map has been extracted from the OCR text (an obvious next step would be to add localities associated with the specimen records). Below the map are the specimen codes. The lack of some USNM specimens is probably due to misinterpreted specimen codes, whereas the CAS specimens don't seem to be online (the California Academy of Sciences has some of its collections in GBIF, but not its molluscs).

Where next?
Once these links between BioStor (and hence, BHL) and GBIF are created then we can do some interesting things. If you visit BioStor and want to learn more about a specimen you can click on the link an view the record in GBIF. We could also envisage doing the reverse. GBIF could augment the information it displays about a specimen by displaying a link to the content in BioStor (e.g., "this specimen is cited by these articles"). Those articles may contain further information about that specimen (for example, the habitat it was collected from, how secure is its identification, and so on).

We could also start to compute the "impact" of different museum collections based on the number of citations of specimens from their collections (this idea is explored further in this paper: http://dx.doi.org/10.1093/bib/bbn022, free preprint available here: hdl:10101/npre.2008.1760.1).

All of this works because we are linking objects (in this case articles and specimens) via their identifiers. Consequently, the links are as stable as their identifiers, which is why I've been pursuing the issue of specimen identifiers recently (see here, here, and here). If GBIF maintains the URLs for the specimens I've linked to, then links I've created could persist. If these URLs are likely to change (e.g., because the metadata from the host institution has changed) then the links (and any associated value we get from them) disappear. This is why I want globally unique, resolvable, persistent identifiers for specimens.




How many specimens does GBIF really have?

GbifDuplicate records are the bane of any project that aggregates data from multiple sources. Mendeley, for example, has numerous copies of the same article, as documented by Duncan Hull (How many unique papers are there in Mendeley?). In their defence, Mendeley is aggregating data from lots of personal reference libraries and hence they will often encounter the same article with slightly differing metadata (we all have our own quirks when we store bibliographic details of papers). It's a challenging problem to identify and merge records which are not identical, but which are clearly the same thing.

What I'm finding rather more alarming is that GBIF has duplicate records for the same specimen from the same data provider. For example, the specimen USNM 547844 is present twice:

As far as I can tell this is the same specimen, but the catalogue numbers differ (547844 versus 547844.6544573). Apart from this the only difference is when the two records were indexed. The source for 547844 was last indexed August 9, 2009, the source for 547844.6544573 was first indexed August 22, 2010. So it would appear that some time between these two dates the US National Museum of Natural History (NMNH) changed the catalogue codes (by appending another number), so GBIF has treated them as two distinct specimens. Browsing other GBIF records from the NMNH shows the same pattern. I've not quantified the extent of this problem, but it's probably a safe bet that every NMNH herp specimen occurs twice in GBIF.

Then there are the records from Harvard's Museum of Comparative Zoology that are duplicates, such as http://data.gbif.org/occurrences/33400333/ and http://data.gbif.org/occurrences/328478233/ (both for specimen MCZ A-4092, in this case the collectionCode is either "Herp" or "HERPAMPH"). These are records that have been loaded at different times, and because the metadata has changed GBIF hasn't recognised that these are the same thing.

At the root of this problem is the lack of globally unique identifiers for specimens, or even identifiers that are unique and stable within a dataset. The Darwin Core wiki lists a field for occurrenceID for which it states:

The occurrenceID is supposed to (globally) uniquely identify an occurrence record, whether it is a specimen-based occurrence, a one-time observation of a species at a location, or one of many occurrences of an individual who is being tracked, monitored, or recaptured. Making it globally unique is quite a trick, one for which we don't really have good solutions in place yet, but one which ontologists insist is essential.

Well, now we see the side effect of not tackling this problem - our flagship aggregator of biodiversity data has duplicate records. Note that this has nothing to do with "ontologists" (whatever they are), it's simple data management. Assign a unique id (a primary key in a database will do fine) that can be used to track the identity of an object even as its metadata changes. Otherwise you are reduced to matching based on metadata, and if that is changeable then you have a problem.

Now, just imagine the potential chaos if we start changing institution and collection codes to conform to the Darwin Core triplet. In the absence of unique identifiers (again, these can be local to the data set) GBIF is going to be faced with a massive data reconciliation task to try and match old and new specimen records.

The other problem, of course, is that my plan to use GBIF occurrence URLs as globally unique identifiers for specimens is looking pretty shaky because they are unique (the same specimen can have more than one) and if GBIF cleans up the duplicates a number of these URLs will disappear. Bugger.



Clustering strings

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.

This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters. For example, given the names


Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821


the service finds three clusters, displayed here using Google images:



(Note to self, investigate canviz as an alternative for displaying graphviz graphs.)

If you are curious, these strings are taxonomic authorities associated with the name Helicella, and based on this clustering there are three taxonomic names, one of which has three different variations of the author's name.