Search this keyword

Linking NCBI to Wikipedia

180px-Sphaerius.acaroides.Reitter.tafel64.jpgIn an earlier post I discussed linking NCBI taxonomy to Wikipedia. One way to tackle this is to add NCBI Taxonomy ID to Wikipedia pages. I reopened the case for adding the Taxonomy IDs to the Taxobox on each taxon page, but this met with substantial resistance. A modified proposal to add them elsewhere to the Wikipedia page seems to be gaining more support (or, at least, less vigorous resistance).

Meanwhile, there are other things that need to be done to linking NCBI and Wikipedia. One is to add Wikipedia page names to NCBI Linkout so that when viewing a NCBI taxon page you will see a link to Wikipedia if a page for the corresponding taxon exists. To create this linkout we need a mapping from NCBI to Wikipedia, and that's what I've been working on for the last few days.

The mapping is still in progress, but essentially I've taken a dump of the NCBI taxonomy for June 3, 2010, and matched the names with those in a the June 18, 2009 dump of Wikipedia that I've analysed elsewhere on this blog. I'll detail the various steps in the mapping elsewhere (there are issues such as synonyms, homonyms, Wikipedia redirects, etc.), but for now things seem to be working reasonably well.

The mapping is being created in a Semantic Mediawiki at http://iphylo.org/linkout/. When complete you will be able to up a NCBI taxon by either it's name (including synonyms and common names) or it's NCBI Taxonomy ID. Where possible I'm mapping the NCBI taxon to Wikipedia, and providing a snippet of text and an image.

I've also extracted bibliographic information from the citations.dmp file that comes with the NCBI dump. This contains the comments that you sometimes see on a taxon page. In a few cases I've added some information manually. For example, the beetle genus Sphaerius has a rather complicated nomenclatural history, which the NCBI page summarises as:
Due to a recent ruling (ICZN 2000), the family and generic names Sphaeriusidae Erichson, 1845, and Sphaerius Waltl, 1838, are both available names and have priority over Microsporidae Crotch, 1873 for the family name and Microsporus Kolenati, 1846 for the single included genus, respectively.

By looking through BioStor I've found some of the papers relating to this ICZN ruling, and added them to the wiki page http://iphylo.org/linkout/Ncbi:174920 (aficionados of zoological nomenclature may enjoy the complexity of the case, due to homonymy between the corresponding family name, Sphaeriidae, and a mollusc family of the same name).

Once thus mapping is complete, it will be time to think of how to get this into NCBI's Linkout, and also how to automatically update the mapping to reflect the growth of both the NCBI taxonomy and Wikipedia. If you visit http://iphylo.org/linkout/ please be aware that the mapping is still being written to the wiki (this is being done via API calls, and adding some 900,000 pages is going to take a while).

Nexus Data Editor running on Mac OS X

winebottler.png
In an earlier post I expressed my amazement that my venerable Nexus Data Editor (NDE) still had users, meaning I had to rebuild the installer so users could install NDE on Windows Vista. Now, Thomas Hauser has gone one better and created an installer for Mac OS X. Given that NDE is a Windows-only program, this is quite a feat. Thomas uses Mike Kroenenberg's (@k3erg) WineBottler to create a version of NDE that can be run on a Mac. WineBottler builds on Wine, which enables Windows software to run on Unix-like operating systems.

To run NDE on a Mac, first download WineBottler from http://winebottler.kronenberg.org/ and install. Then grab the file NDE.dmg Thomas has created. Install the file NDE in your Applications file and run it. Note that you will need X11 installed on your Mac. If you don't have this it should be on the installation disk that came with your computer. After a short pause you should see NDE appear in a X11 window. Below is a screen shot showing NDE editing the example file (Bembidion.nex) that comes with the program:
ndemac.png

Many thanks to Thomas for his efforts in packaging NDE with Winebottler, and for making available the NDE.dmg file he created.

TreeBASE II RDF

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study. For example, if a TreeBASE study has a DOI, then we could link it to bibliographic details for the study, and through them to other information, such as GenBank sequences, specimens, etc. (see my little linked data browser for an example of some of this linking). If we added a phylogeny viewer, then we'd have a great tool for browsing the basic components of a phylogenetic study.

Unfortunately, we're not there yet. I've been trying to make sense of TreeBASE II RDF, and frankly, it's a mess. Here are some of the problems:

TreeBASE URIs aren't linked data compliant
The canonical URI for a study (e.g., http://purl.org/phylo/treebase/phylows/study/TB2:S10423) doesn't conform to the linked data approach. In fact, the URI crashes the linked data validator, so I tried another test.


curl --include
--header "Accept: application/rdf+xml"
http://purl.org/phylo/treebase/phylows/study/TB2:S10423

To be a valid linked data resource this request should return a 303 HTTP status code. Instead we get a 302 and some HTML. Linked data clients won't be able to extract information from this URI.

SKOS matching
There are some odd things going on in the RDF. It contains statements of the form:

<rdf:Description rdf:ID="otu1789319">
<skos:closeMatch rdf:resource="http://purl.uniprot.org/taxonomy/76066.rdf">
</rdf:Description>

(I've tidied this up a little from the original, rather verbose RDF). This asserts that the TreeBASE OTU otu1789319 corresponds to the NCBI taxon with the taxonomy id 76066 (represented by the Uniprot URI). Except, it doesn't really. As far as I understand it, SKOS is about matching concepts, not documents. The URI http://purl.uniprot.org/taxonomy/76066.rdf is a document URI (specifically, a RDF document), the URI http://purl.uniprot.org/taxonomy/76066 is the taxon. The match should really be to http://purl.uniprot.org/taxonomy/76066. Then I've come across statements that match TreeBASE OTUs to http://purl.uniprot.org/taxonomy/0.rdf. This URI doesn't exist (we get a 404). This seems an odd way to say that we don't have a match -- if we don't have a match, don't include it in the RDF.

Local URIs for trees don't work
The RDF is full of local URIs such as http://purl.org/phylo/treebase/phylows/#tree1790755, which don't resolve. In fact they generate a rather spectacular Tomcat exception. I don't understand why we need local URIs. Everything in TreeBASE should have a global URI. Then we can avoid unnecessary statements such as:

http://purl.org/phylo/treebase/phylows/#tree1790755 owl:sameAs http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899

which links a local resource to a global one http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899. Incidentally, this URI doesn't resolve, despite claims that this bug has been fixed.

No links between tree and study
But the show stopper for me is that there is no link between a study and a tree! There is no triple in the RDF specifying any relationship between these two entities. To me this is just about the most important thing I need. I want to be able to query TreeBASE RDF using a study identifier (either from TreeBASE itself, or from an external identifier such as a DOI or a PubMed number). As it stands the TreeBASE II RDF is almost useless. I can't get it via a linked data client, it's full of URIs that don't resolve, and it lacks key triples that would glue things together.

RDF != XML

I can't help thinking that the RDF output hasn't been designed with end use in mind. I know from my own experience that it's not until you try to do something with the RDF that you realise how poor some design decisions may have been.

It's not enough to pump out RDF and hope for the best. RDF is not XML, which is just a verbose format for moving data around. RDF brings with it all sorts of expectations about how clients will resolve it, how they will interpret URIs, and the kinds of queries that will be performed. We are achingly close to being able to tie everything together, but not with RDF TreeBASE II is currently making available.