Search this keyword

GBIF and Linked Data

At the end of day two of the GBIF LSID-GUID Task Group I put together this crude diagram to summarise some of the possible links between biodiversity data and the larger linked data cloud, which I, among others, have argued is where biodiversity informatics should be heading. Here's my hastily put together diagram (created using the wonderful OmniGraffle):
Links.jpg


I've put GBIF at the centre since we're at GBIF, and it's them we are trying to convince. Yellow circles are biodiversity data sources (which aren't linked data providers (but some can me made so using, for example, my LSID proxy resolver), white circles are linked data sources.

The "sales pitch"is that if we join the linked data cloud we open up the possibility of some very powerful queries, especially once that are outside the relatively narrow scope of what GBIF and TDWG concern themselves with. Imagine being able to query biodiversity data with respect to population and economic data across countries. These are the sort of things we could realistically aim for.

On a practical level, it also means biodiversity database could devolve a lot of their tasks to other databases (via reusing identifiers). Some taxonomists have DBPedia URIs, and more could be added to Wikipedia (and so will find there way into DBPedia). Geonames provides geographic URIs which we could reuse, and so on. Within our own community we could do a better job of reusing our own identifiers, and reusing external ones (such as taxa in Wikipedia).

It's late, this is a rushed diagram, and I don't know if it's going to end up in whatever report we manage to assemble tomorrow (our final day). But I hope it captures some of the scope of what we're looking at. I know there are some problems (as have been pointed out to me on Twitter), I'll try and deal with these tomorrow.

Wikispecies RSS feed

Following on from my previous post about Wikispecies (which generated some discussion on TAXACOM) I've played some more with Wikispecies.

AS a first step I've added a Wikispecies RSS feed to my list of RSS feeds. This feed takes the original Wikispecies RSS feed for new pages (generated by the page Special:NewPages) and tries to extract some details before reformatting it as an ATOM feed. Specifically, I extract GUIDs such as IPNI and Index Fungorum identifiers, bibliographic references (which I will later parse to try and extract identifiers such as DOIs), and latitude and longitude if the Wikispecies page has type locality information. Having the later means that the RSS feed can be displayed as a map (Google Maps can take a RSS feed with geotagged items and display it on a map for you).

The map below is live, so it will show any geotagged items in the current Wikispecies feed.


View Larger Map


Wikispecies is not a database


This post was prompted by Stephen Thorpe's post on TAXACOM about Wikispecies in which he wrote (in a thread discussing Roger Hyam's recent blog post) that
[i]f it [Wikispecies] isn't a true database, then it is BETTER than a database. It can do anything a database can do, and more, if you know how it works properly.
I beg to differ. Wikispecies runs on a database (the Mediawiki software uses a database to store the wiki), and Mediawiki can be thought of as a database of semi-structured text, but it lacks a lot of the functionality database users would expect. For example, in Wikispecies there's no way to perform basic queries such as how many descendants a given taxon has, what names a particular author has published, or to find out in which geographic region most new names are being described from. Much of this information is in Wikispecies, it just isn't in a form that we can usefully use.

These limitations are mostly due to the underlying software (Mediawiki), which fortunately can be extended to address these issues using Semantic Mediawiki. I've explored these ideas earlier. With some restructuring, Wikispecies could become a database, but it would require some serious work.

But this raises the real issue with Wikispecies, namely what is it for? Wikipedia is much more informative for many taxa, and the two wikis are very poorly linked (surely we'd want Wikipedia pages linked to the corresponding Wikispecies pages?). Given that Wikipedia is the basis for some core efforts in linked data (e.g., DBPedia), it seems a no brainer that we would want our information stored in Wikipedia, rather than Wikispecies.

It seems to me that the split between Wikipedia and Wikispecies parallels that between "taxonomic concepts" and "taxonomic names". Wikipedia provides the former, in that it provides one (consensus) view of what a taxon is. Wikispecies would be ideally placed to be a nomenclatural database (and a great place to put all the synonyms that we've accumulated over time, but which would swamp Wikipedia). But Wikispecies seems also to want to provide a classification as well, which strikes me as unnecessary (and raises the issue of how this relates to the classification in Wikipedia).

I don't wish to denigrate the efforts of Wikispecies contributors (they are doing some neat things, such as harvesting new names from Zookeys), and by clever use of templates they avoid some of the serious problems with classification in Wikipedia, but it's not a taxonomic database, at least, not yet.