Search this keyword

Replicating and forking data in 2010: Catalogue of Life and CouchDB

Time (just) for a Friday folly. A couple of days ago the latest edition of the Catalogue of Life (CoL) arrived in my mailbox in the form of a DVD and booklet:

photo.JPG
While in some ways it's wonderful that the Catalogue of Life provides a complete data dump of its contents, this strikes me as a rather old-fashioned way to distribute it. So I began to wonder how this could be done differently, and started to think of CouchDB. In particular, I began to think of being able to upload the data to a service (such as Cloudant) where the data could be stored and replicated at scale. then I began to think about forking the data. The Catalogue of Life has some good things going for it (some 1.25 million species, and around 2 million names), and is widely used as the backbone of sites such as EOL, GBIF, and iNaturalist.org, but parts of it are broken. Literature citations are often incomplete or mangled, and in places it is horribly out of date.

Rather than wait for the Catalogue of Life to fix this, what if we could share the data, annotate it, correct mistakes, and add links? In particular, what if we link the literature to records in the Biodiversity Heritage Library so at we can finally start to connect names to the primary literature (imagine clicking on a name and being able to see the original species description). We could have something akin to github, but instead of downloading and forking code, we download and fork data. CouchDB makes replicating data pretty straightforward.

So, I've started to upload some Catalogue of Life records to a CouchDB instance at Cloudant, and write a simple web site to display these records. For example, you can see the record for at http://iphylo.org/~rpage/col/?id=e9fda47629c1102b9a4a00304854f820:

croc.png
The e9fda47629c1102b9a4a00304854f820 in this URL is the UUID of the record in CouchDB, which is also the UUID embedded in the (non-functional) CoL LSIDs. This ensures the records have a unique identifier, but also one that is related to the original record. You can search for names, or browse the immediate hierarchy around a name. I hope to add more records over time as I explore this further — at the moment I've added a few lizards, wasps, and conifers while I explore how to convert the CoL records into a sensible JSON object to upload to CouchDB.

The next step is to think about this as a way to distribute data (want a copy of CoL, just point your CouchDB at the Cloudant URL and replicate it), and to think about how to build upon the basic records, editing and improving them, then thinking about how to get that information into a future version of the Catalogue.

Mendeley Connect

When I first launched BioStor (an article finding tool built on the top of the (Biodiversity heritage Library) I wanted people to be able to edit metadata and add references, but also minimise the chances that junk would get added. As a quick and dirty deterrent I used reCAPTCHA, so anybody adding a reference or editing the metadata had to pass a CAPTHCA before their edits were accepted.

While reCAPTCHA does the trick, it can be tedious for somebody editing a lot of articles to have to pass a CAPTHCA every time they edit an article. Ed Baker of the International Commission on Zoological Nomenclature (ICZN) has a project to identify all the articles in the Bulletin of Zoological Nomenclature, and has been gently bugging me to add a login feature to BioStor. I played for a while with OpenID, but it occurred to me that Mendeley might be a more sensible strategy. Mendeley's API supports OAuth, a protocol where you can grant an application access to another application, but without giving away any passwords. It's used by Twitter and Facebook, among others. Indeed, a growing number of sites on the web are using Twitter and/or Facebook services to enable users to log in, rather than write their own code to support login, usernames, passwords, etc.

In the case of BioStor, I've added a link to sign in via Mendeley. if you click on it you get taken to a page like this:

connect.png
If you're happy for BioStor to connect to Mendeley, you click on Accept and BioStor won't bug you to fill in a CAPTCHA. Once Mendeley's API matures it would be nice to add features such as the ability to add a reference in BioStor straight to your Mendeley library (this is doable now, but the Mendeley API looses some key metadata such as page numbers).

facebook-connect.jpg
But, thinking more broadly, Mendeley has an opportunity here to provide services similar to Facebook Connect. For example, instead of simply having buttons on web pages to bookmark papers, we could have buttons indicating how many people had added a paper to their library, and whether any of those people were in your contacts. We could extend this further an create something like Facebook's Open Graph Protocol, which supports the "Like" button. Or perhaps, we could have an app that integrates with Facebook and harvests your "Likes" that are papers.

Food for thought. Meantime, I hope users like Ed will find BioStor less tedious to use now that they can log in via Mendeley.

GeoCouch

@mikeal a little tedious. you can take OSM and then convert it to SHP and then http://github.com/maxogden/shp2geocouchless than a minute ago via web



The tweet above inspired me to take a quick look at GeoCouch, a version of CouchDB that supports spatial queries. This is something I need if I'm going to start playing seriously with CouchDB. So, it was off to Installing and working with GeoCouch, grabbing a copy of HomeBrew (yet another package manager for Mac OS X), in the hope of installing GeoCouch. Things went fairly smoothly, although it took what seemed like an age to build everything. But I now have GeoCouch running. Previously I'd been running CouchDB using http://janl.github.com/couchdbx/, which launches vanilla CouchDB. However, if you launch CouchDBX after starting GeoCouch from the command line, CouchDBX is talking to GeoCouch.

I then grabbed shp2geocouch to try some shape files (I grabbed some shape files from the IUCN to play with). If you're on a Mac grab GISLook to get Quick Look previews of these files. Since I'm new to ruby there were a couple of gotchas, such as lacking some prerequisites (httparty and couchrest, both installed by typing gem install <name of package>), and there was the small matter of needing to add ~/.gem/ruby/1.8/bin to my path so I could find shp2geocouch (spot the ruby neophyte). The shape file didn't get processed completely, but at least I managed to get some data into GeoCouch.

gis.png
So far I've been playing with the examples at http://github.com/vmx/couchdb, and things seem to work. At least, the basic bounding box queries work. I'm tempted to play with this some more (and get my head arounbd GeoJSON), perhaps trying to recreate the functionality of my Elsevier Challenge entry, for which I wrote a custom key-value database that was awfully clunky.