Search this keyword

BioNames update - API documentation

D4844ff2657d813d00fbd0741e8377a4
One of the fun things about developing web sites is learning new tricks, tools, and techniques. Typically I hack away on my MacBook, and when something seems vaguely usable I stick it on a web server. For BioNames things need to be a little more formalised, especially as I'm collaborating with another developer (Ryan Schenk). Ryan is focussing on the front end, I'm working on the data (harvesting, cleaning, storing).

In most projects I've worked on the code to talk to the database and the code to display results have been the same, it was ugly but it got things. For this project these two aspects have to be much more cleaning separated so that Ryan and I can work independently. One way to do this is to have a well-defined API that Ryan can develop against. This means I can hide the sometimes messy details of how to communicate with the data, and Ryan doesn't need to worry about how to get access to the data.

Nice idea, but to be workable it requires that the API is documented (if it's just me then the documentation is in my head). Documentation is a pain, and it is easy for it to get out of sync with the code such that what the docs say an API does and what it actually does are two separate things (sound familiar?). What would be great is a tool that enables you to write the API documentation, and make that "live" so that the API output can be tested against. In other words, a tool like apiary.io.

Apiary.io is free, very slick, and comes with GitHUb integration. I've started to document the BioNames API at http://docs.bionames.apiary.io/. These documents are "live" in that you can try out the API and get live results from the BioNames database.

I'm sure this is all old news to real software developers (as opposed to people like me who know just enough to get themselves into trouble), but it's quite liberating to start with the API first before worrying about what the web site will look like.

New look Biodiversity Heritage Library launched


The new look Biodiversity Heritage Library has just launched. It's a complete refresh of the old site, based on the Biodiversity Heritage Library–Australia site. If you want an overview of what's new, BHL have published a guide to the new look site. Congrats to involved in the relaunch.

One of the new features draws on the work I've been doing on BioStor. The new BHL interface adds the notion of "parts" of an item, which you can see under the "Table of Contents" tab. For example, the scanned volume 109 of the Proceedings of the Entomological Society of Washington now displays a list of articles within that volume:

Newbhl
This means you can now jump to individual articles. Before you had to scroll through the scan, or click through page numbers until you found what you were after. The screenshot above shows the article "Three new species of chewing lice (Phthiraptera: Ischnocera: Philopteridae) from australian parrots (Psittaciformes: Psittacidae)". The details of this article have been extracted from BioStor, where this article appears as http://biostor.org/reference/55323. You can go directly to this article in BHL using the link http://www.biodiversitylibrary.org/part/69723. As an aside, I've chosen this article because it helps demonstrate that BHL has modern content as well as pre-1923 literature, and this article names a louse, Neopsittaconirmus vincesmithi after a former student of mine, Vince Smith. You're nobody in this field unless you've had a louse named after you ;)

BioStor has over 90,000 articles, but this is a tiny fraction of the articles contained in BHL content, so there's a long way to go until the entire archive is indexed to article level. There will also be errors in the article metadata derived from BioStor. If we invoke Linus's Law ("given enough eyeballs, all bugs are shallow") then having this content in BHL should help expose those errors more rapidly.

As always, I have a few niggles about the site, but I'll save those for another time. For noe, I'm happy to celebrate an extraordinary, open access archive of over 40 million pages. BHL represents one of the few truly indispensable biodiversity resources online.

BioNames ideas - automatically finding synonyms from the literature

One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.

Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms.

Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms?

One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.

  • Atlantoxerus getulus
  • Sciurus getulus
  • Xerus getulus

If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms.

There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species).

If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here):

Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016

Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature.