Search this keyword

Linking NCBI taxonomy to GBIF


In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:

<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Can you trust EOL?

There's a recent thread on the Encyclopedia of Life concerning erroneous images for the crab Leptograpsus. This is a crab I used to chase around rooks on stormy west-coast beaches near Auckland, so I was a little surprised to see the EOL page for Leptograpsus looks like this:

Leptograpsus

The name and classification is the crab, but the image is of a fish (Lethrinus variegatus). Perhaps at some point in aggregating the images the two taxa, which share the abbreviated name "L.variegatus" got mixed up.

Now, errors like this are bound to happen in a project the size of EOL, and EOL has some pretty active efforts to clean up errors (e.g., the Homonym Hunters). But what bothers me about this example is the prominent label Trusted that appears below the image. If I look at all the images for Leptograpsus on EOL, I see "trusted" images for fish. All images of the crab (i.e., the real Leptograpsus) are labelled "unreviewed" and implicitly "untrusted":

Leptograpsus2

If you are going to claim something is "trusted" you need to be very careful. The images of the fish may well come from a trusted source (FishBase), and FishBase's assertion that the image is of Lethrinus variegatus may well be "trusted", but I certainly can't trust the assertion made by EOL that this image depicts a crab.

In this example the error is easy to spot (if you know that crabs and fish are different), but what if the error was more subtle? Or what if you are using EOL's API and explicitly asking for only content you can trust? Then you get the fish images (see https://gist.github.com/2850321).

If I can't trusted "trusted" then EOL has a problem. One way forward is to unpack the notion of "trust" and make sure the user knows what "trusted" means. In this case there are at least two assertions being made:
  1. This image is of a fish (made by FishBase)
  2. This image is of a crab (made by EOL)

EOL needs to make clear what assertions are being made, and which ones it is stating can be "trusted". Ideally it also needs to move away from blanket assertions of "trusted" versus not trusted, because that's far too coarse (just because FishBase knows about fish I'm not sure I'm going to put equal trust that every image it contains has been correctly identified). Trust is something that is conferred by users and acquired over time, not something to be simply asserted.

The GBIF classification is broken — how do we fix it?

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help link to GBIF occurrence records.

The GBIF backbone classification ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.

On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (http://biostor.org/reference/88554, see also http://dx.doi.org/10.1080/00222939208677383) part of the classification looks like this:

└Animalia
└Annelida
└Polychaeta
└Sabellida
└Serpulidae
└Hipponyx
Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.

Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of Hipponyx and Hipponix under the genus Hipponix, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, Hipponix credited to DeFrance 1819 Journ. de Physique, 88, 217, and Hipponyx credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (Mémoire sur un nouveau genre de mollusque) and Bull. Sci. Soc. philom. Paris, 8. (Sur un nouveau genre de coquilles (Hipponix)). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling Hipponix (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).

OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).



So, bit of a mess, but this still doesn't deal with Hipponyx being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a specimen record from Te Papa labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:

Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. doi:10.1080/00288306.1971.10426332

with the following abstract:
An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.

The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.

The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.

The author of the paper (Charles Fleming) argues that Hipponyx inexpectata, regarded as a mollusc by its describer (Marjorie K. Mestayer, see Notes on New Zealand Mollusca. No. 4.) is actually a worm, and he moves it to the genus Temporaria.

So it seems that the reason Hipponyx has ended up being a worm in the GBIF classification is due to this synonymy.

Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think Hipponyx is a worm). As has been said several times before, most eloquently by David Shorthouse, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.