Search this keyword

Wikipedia taxonomy, the good, the bad, and the very ugly

In the previous post I suggested that a productive way to meet EOL's goal of a web page per taxon would be to build upon Wikipedia, rather than go it alone. In a nutshell the arguments were:

  1. Wikipedia has considerable traction and has some richly populated taxon pages

  2. The linked data community uses DBPedia.org as a core source of URIs for entities, as as DBPedia is derived from Wikipedia the later will be the core source of identifiers for taxa

To explore this a little further I grabbed two files from the 20090618 Wikipedia dump, namely page.sql and templatelinks.sql, and extracted page ids and titles for Wikipedia pages containing the Taxobox template. I then queried Wikipedia for the source for each of these pages, and tried to extract the taxonomic information from each page (a tedious and error-prone process at best).

I've put together a shockingly crude web page where you can browse the results (warning, this page is a 10 minute hack with little error checking).

There is some good news. There are over 120,000 taxon pages (I've not got an exact figure because the Taxobox template occurs on some pages that aren't taxon pages, such as documentation and user pages). Some pages are extensive (the largest page is Dinosaur for which the source text is 128K in size), and there are lots of links to external references (I counted 7205 distinct DOIs to papers and/or books, and 3248 distinct ISBNs). This represents a degree of external linkage that puts EOL to shame.

However, there are also some major problems. Firstly, Wikipedia does not have a single, internally consistent classification (i.e., the classification is not a tree). This is not unexpected, given that Wikipedia pages comprise semi-structured text that is (largely) manually entered. It's not a database. If it were, the simplest way to ensure consistency would be to have each child node include a pointer to its parent, and when we want a list of the children of the parent node we simply query the database ("what nodes have this node as their parent?"). Because Wikipedia isn't a database, authors have entered these two relationships ("has parent" and "has child") on different pages, and these often conflict. For a spectacular example of this, take a look at the page for Amphibia. When I scrapped Wikipedia I extracted the "has parent" link, as this is the simplest way to create a tree. This results in over 200 child taxa for Amphibia, yet the Wikipedia page for Amphibia lists only four child taxa. What appears to be happening is that many fossil taxa are being added to Wikipedia, and since we are often hazy about where they go in the tree, authors are listing their parent taxon as (in this case) "Amphibia". Given this direct link, they should also be listed as children of Amphibia (although, of course, that would make a mess of the Amphibia page). Perhaps the solution is to add a "incerta sedis" taxon page for each taxon, and make that the parent of all the taxa that we're aren't sure where to put. This would ensure consistency, but not make the current taxon pages unreadable.

Homonymy (the same name for different taxa) also raises it's ugly head. For example, the page for the crab family Latreilliidae lists the genus Latreillia, which is a fly. In this case, the fly genus Latreillia Robineau-Desvoidy, is a junior homonym of the crab genus Latreillia Roux (see http://biodiversitylibrary.org/page/12221111).

Finally, the page titles (which become the basis of DBPedia.org URIs) are a muddled mixture of common and scientific names.

So, what to do? Well, the idea of simply using Wikipedia as is isn't going to fly, it's too broken. We will have to contemplate a concerted effort to fix it (which will require using bots to clean up the inconsistencies). Another option (assuming that we like the Wiki-style environment) is to use a semantic wiki (see my earlier post), which constrains some of the possible markup, but retains a lot of the freedom that make wikis so powerful.

This isn't an argument for not using Wikipedia as such, it's arguably still much more informative than, say, EOL. It's just that it's showing signs of the limitations of free-form text entry. The trick is to find a way to combine the obvious strengths of this approach (ease of creating and editing pages, massive community support) with the more structured approach needed to avoid the internal inconsistencies that currently bedevil Wikipedia.

EOL, Wikipedia, TDWG, LinkedData, and the Vision Thing

Time for more half-baked ideas. There's been a lot of discussion on Twitter about EOL, Linked Data (sometimes abbreviated LOD), and Wikipedia. Pete DeVries (@pjd) is keen on LOD, and has been asking why TDWG isn't playing in this space. I've been muttering dark thoughts about EOL, and singing the praises of Wikipedia. On so it goes on. So, here's one vision of where we could (?should) be going with this.

Let's imagine that we do indeed want to play in the Linked Data space. The concern that tends to raised the most is that biodiversity informatics uses LSIDs as the standard GUID, and this doesn't play nice with Linked Data. This is true, but not life threatening. There are various hacks (like this and this that deal with this).

But, the real concern (I think) is that we need a way to link our stuff to the rest of the Linked Data cloud. That is, wherever possible we need to reuse existing identifiers. In the LOD diagram below (for the latest version see here) DBpedia.org is key to linking much of this together, and major players (such as the BBC) are now using DBpedia.org to make connections.



DBpedia.org is based on Wikipedia, so I think you can see where this is going. There are some 120,000+ taxon pages in Wikipedia, so that's some 120,000+ identifiers in DBpedia.org that others interested in organisms can (and will) use to refer to taxa. Given the centrality of Wikipedia and DBpedia to LOD, why don't we adopt DBpedia.org URIs as the default GUID for our taxa? At present we have numerous, competing identifiers (e.g., NCBI tax ids, ITIS tsn's, Catalogue of Life LSIDs, uBio NameBankID's, plus LSIDs from various nomenclators). For users this is a mess -- which one do I use? Deciding requires dealing with issues (such as the difference between nomenclatural codes, and between taxonomic names and concepts, etc., that frankly, nobody outside our community cares about.

So, if we want to play with LOD, we need to make our identifiers play nice (straightforward), and we should think seriously about adopting DBpedia.org URIs as the default GUID for taxa.

Now, where does this leave EOL? Well, frankly, it should get out of the business of making web pages for taxa, because Wikipedia owns that space already. Their pages are fewer, but often much more detailed than the corresponding EOL page, and Wikipedia reacts faster to new discoveries. Wikipedia supports community editing, versioning, and quite sophisticated tools for handling biblographic references.

There's plenty of scope for userful tools and services for EOL to develop, but I think the real game is elsewhere. Now, Wikipedia is far from perfect. It's basically semi-structured text with a God-awful template language, and it would benefit greatly from more structure (e.g., as could be provided by Semantic Mediawiki), but I think we should think about building upon it. We could build our own (and my experiments over at itaxon.org explore this), but the big challenge is getting a community around a project, and if David Shorthouse's pronouncement that The Community is Dead is correct, then maybe we should get on board with the community that already exists. Perhaps what EOL should be doing is talking to Wikipedia, improving the existing templates for taxon pages, and creating bots to automatically populate Wikipedia with more taxon pages.

Visualising taxonomic classifications using SpaceTrees

The problem of displaying large taxonomic classifications on a web page continues to be an on again-off again obsession. My latest experiment makes use of Nicolas Garcia Belmonte's wonderful JavaScript Infovis Toolkit (JIT), which provides implementations of classic visualisations such as treemaps, hyperbolic trees, and SpaceTrees.

SpaceTrees were developed at Maryland's HCIL lab, and that lab has applied them to biodiversity informatics. The LepTree project has also used them (see LepTaxonTree). I've not been a huge fan, mainly because the existing implementation is a stand-alone Java program, which somewhat limits it's utility. But JIT changes all that.

To get a sense of whether SpaceTrees would be useful, I took Belmonte's second SpaceTree example as a starting point. In this example, nodes are created on demand (rather than loading the entire tree into memory). It proved relatively straightforward (after getting my head around making Ajax requests using Mootools) to modify the example to load nodes from a local copy of the Catalogue of Life 2008 classification.



I've put a live version of the Catalogue of Life SpaceTree up at http://bioguid.info/demos/spacetree. It doesn't do much, beyond display the tree, together with some basic information about the node. But I think it shows the power of Javacsript to create pleasing visualisations, and the potential of SpaceTrees as a simple tool to browse large taxonomic classifications.