Search this keyword

TreeBASE II RDF

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study. For example, if a TreeBASE study has a DOI, then we could link it to bibliographic details for the study, and through them to other information, such as GenBank sequences, specimens, etc. (see my little linked data browser for an example of some of this linking). If we added a phylogeny viewer, then we'd have a great tool for browsing the basic components of a phylogenetic study.

Unfortunately, we're not there yet. I've been trying to make sense of TreeBASE II RDF, and frankly, it's a mess. Here are some of the problems:

TreeBASE URIs aren't linked data compliant
The canonical URI for a study (e.g., http://purl.org/phylo/treebase/phylows/study/TB2:S10423) doesn't conform to the linked data approach. In fact, the URI crashes the linked data validator, so I tried another test.


curl --include
--header "Accept: application/rdf+xml"
http://purl.org/phylo/treebase/phylows/study/TB2:S10423

To be a valid linked data resource this request should return a 303 HTTP status code. Instead we get a 302 and some HTML. Linked data clients won't be able to extract information from this URI.

SKOS matching
There are some odd things going on in the RDF. It contains statements of the form:

<rdf:Description rdf:ID="otu1789319">
<skos:closeMatch rdf:resource="http://purl.uniprot.org/taxonomy/76066.rdf">
</rdf:Description>

(I've tidied this up a little from the original, rather verbose RDF). This asserts that the TreeBASE OTU otu1789319 corresponds to the NCBI taxon with the taxonomy id 76066 (represented by the Uniprot URI). Except, it doesn't really. As far as I understand it, SKOS is about matching concepts, not documents. The URI http://purl.uniprot.org/taxonomy/76066.rdf is a document URI (specifically, a RDF document), the URI http://purl.uniprot.org/taxonomy/76066 is the taxon. The match should really be to http://purl.uniprot.org/taxonomy/76066. Then I've come across statements that match TreeBASE OTUs to http://purl.uniprot.org/taxonomy/0.rdf. This URI doesn't exist (we get a 404). This seems an odd way to say that we don't have a match -- if we don't have a match, don't include it in the RDF.

Local URIs for trees don't work
The RDF is full of local URIs such as http://purl.org/phylo/treebase/phylows/#tree1790755, which don't resolve. In fact they generate a rather spectacular Tomcat exception. I don't understand why we need local URIs. Everything in TreeBASE should have a global URI. Then we can avoid unnecessary statements such as:

http://purl.org/phylo/treebase/phylows/#tree1790755 owl:sameAs http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899

which links a local resource to a global one http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899. Incidentally, this URI doesn't resolve, despite claims that this bug has been fixed.

No links between tree and study
But the show stopper for me is that there is no link between a study and a tree! There is no triple in the RDF specifying any relationship between these two entities. To me this is just about the most important thing I need. I want to be able to query TreeBASE RDF using a study identifier (either from TreeBASE itself, or from an external identifier such as a DOI or a PubMed number). As it stands the TreeBASE II RDF is almost useless. I can't get it via a linked data client, it's full of URIs that don't resolve, and it lacks key triples that would glue things together.

RDF != XML

I can't help thinking that the RDF output hasn't been designed with end use in mind. I know from my own experience that it's not until you try to do something with the RDF that you realise how poor some design decisions may have been.

It's not enough to pump out RDF and hope for the best. RDF is not XML, which is just a verbose format for moving data around. RDF brings with it all sorts of expectations about how clients will resolve it, how they will interpret URIs, and the kinds of queries that will be performed. We are achingly close to being able to tie everything together, but not with RDF TreeBASE II is currently making available.

TreeBASE II makes me pull my hair out

I've been playing a little with TreeBASE II, and the more I do the more I want to pull my hair out.

Broken URLs
The old TreeBASE had a URL API, which databases such as NCBI made use of. For example, the NCBI page for Amphibolurus nobbi has a link to this taxon in TreeBASE. The link is http://www.treebase.org/cgi-bin/treebase.pl?TaxonID=T31183&Submit=Taxon+ID. Now, this is a fragile looking link to a Perl CGI script, and sure enough, it's broken. Click on it and you get a 404. In moving to the new TreeBASE II, all these inward links have been severed. At a stroke TreeBASE has cut itself off from an obvious source of traffic from probably the most important database in biology. Please, please, throw in some mod_rewrite and redirect these CGI calls to TreeBASE II.

New identifiers
All the TreeBASE studies and taxa have new identifiers. Why? Imagine if GenBank decided to trash all the accession numbers and start again from scratch. TreeBASE II does support "legacy" StudyIDs, so you can find a study using the old identifier (you know, the one people have cited in their papers). But there's no support for legacy TaxonIDs (such as T31183 for Amphibolurus nobbi). I have to search by taxon name. Why no support for legacy taxon IDs?

Dumb search
Which brings me to search. The search interface for taxa in TreeBASE is gloriously awful:

tbsearch.png

So, I have to tell the computer what I'm looking for. I have to tell it whether I'm looking for an identifier or doing a text search, then within those categories I need to be more specific: do I want a TreeBASE taxon ID (new ones of course, because the old ones have gone), NCBI id, or uBio? And this is just the "simple" search, because there's an option for "Advanced search" below.

Maybe it's just me, I get really annoyed when I'm asked to do something that a computer can figure out. I shouldn't have to tell a computer that I'm searching for a number or some text, nor should I tell it what that number of text means. Computers are pretty good at figuring that stuff out. I want one search box, into which I can type "Amphibolurus nobbi", or "Tx1294" or "T31183" or "206552" or "6457215" or "urn:lsid:ubio.org:namebank:6457215" (or a DOI, or a text string, or pretty much anything) and the computer does the rest. I don't ever want to see this:

tbsearch2.png

Computers are dumb, but they're not so dumb that they can't figure out if something is a number or not. What I want is something close to this:

google.png

Is this really too much to ask? Can we have a search interface that figures out what the user is searching for?

Note to self: Given that TreeBASE has an API, I wonder how hard it would be to knock up a tool that took a search query, ran some regular expressions to figure out what the user might be interested in, then hit the API with that search, and returned the results?

My concern here is that TreeBASE II is important, very important. Which means it's important to make it usable, which means don't break existing URLs, don't make old identifiers disappear, and don't have a search interface that makes me want to pull my hair out.

Linked data part 2

Continuing the Friday folly theme, below is a screencast of a linked data browser that uses the same ideas as last week's screencast, but uses a custom browser I've written to display the results in a more user-friendly way.

Linking the data together from Roderic Page on Vimeo.



The demo is live, you can view it at http://iphylo.org/~rpage/browser/www/uri/http://bioguid.info/doi:10.1371/journal.pone.0001787. Under the hood the browser uses bioGUID as the primary linked data provider (although it should consume any valid linked data source, for example Dbpedia). The data is stored in a local triple store (ARC), and the web interface is created by transforming SPARQL queries into HTML using XSLT. You can add data to it by editing the URL in the browser location bar and reloading the page, or entering a URL on the page. Linked data URLs be entered next to the Browse button as is, e.g. http://dbpedia.org/resource/Euphausia, or appended to http://iphylo.org/~rpage/browser/www, e.g. http://http://iphylo.org/~rpage/browser/www/http://dbpedia.org/resource/Euphausi. Other identifiers, such as DOIs, PubMed ids, and specimens need to be resolved via bioGUID, e.g. http://iphylo.org/~rpage/browser/www/uri/http://bioguid.info/gi:86161637.

All still very crude, but I hope you get the idea.