Search this keyword

Connotea tags

For fun I quickly programmed a little tool for bioGUID that makes use of Connotea's web api. When an article is displayed, the page loads a Javascript script that makes a call to a simple web service that looks up a reference in Connotea and displays a tag cloud if the reference is found. For example, the paper announcing Zoobank (doi:10.1038/437477a) looks like this:


The reference has been bookmarked by 6 people, using 15 tags, some more popular than others. The tags and users are linked to Connotea.

This service is can be accessed at http://bioguid.info/services/connotea.php?uri=<doi here>, for example http://bioguid.info/services/connotea.php?uri=doi:10.1038/437477a. By default it returns JSON (you can also set the name of the callback function by add a &callback= parameter), but you can get HTML by adding &format=html. The HTML is also included in the JSON result, if you want to quickly display something, rather than roll your own.

Basically the service takes the DOI you supply, converts it to an MD5 hash, then looks it up in Connotea. There were a few little "gotcha's", such as the fact that the Connotea user may have bookmarked "doi:10.1038/43747" or the proxied version "http://dx.doi.org/10.1038/43747", and these have different MD5 hashes. My service tries both variations and merges the results.

Accessing specimens using TAPIR or, why do we make this so hard?

OK, second rant of the day. One of my favourite online specimen databases is AntWeb. For a while the ability to harvest data from this database using the venerable DiGIR protocol hasn't been possible, due to various issues at the California Academy of Sciences. Well, now it's back, and "accessible" using TAPIR (TAPIR - TDWG Access Protocol for Information Retrieval). Accessible, that is, if you like horrifically over-engineered, poorly documented standards. OK, at lot of work has gone into TAPIR, there's lots of great code on SourceForge, and there's lots of documentation, but I've really struggled to get the most basic tasks done.

For example, let's imagine I and want to retrieve the information on the ant specimen CASENT0100367 (note how trivial this is via a web browser, just append the specimen name to http://www.antweb.org/specimen.do?name=). After much clenching of teeth struggling with the TAPIR documentation and the TAPIR client software, I finally found an email by Markus Döring that gave me the clue. If I'm going to construct a URL to retrieve this specimen record, I need to include the URL of an XML document that serves as a template for the query. Since one doesn't exist, I have to create it and make it accessible to the TAPIR server (i.e., the AntWeb TAPIR server needs to access it, so I have to place this XML document on my web server). The template (shown below) lives at http://bioguid.info/tapir/dwc_catalog_number.xml:

<?xml version="1.0" encoding="UTF-8"?>
<searchTemplate xmlns="http://rs.tdwg.org/tapir/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xsi:schemaLocation="http://rs.tdwg.org/tapir/1.0
http://rs.tdwg.org/tapir/1.0/schema/tapir.xsd
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema.xsd">
<label>Scientific name in query</label>
<documentation>Query for a Scientific Name. Based on http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml, found in email by Markus Döring http://lists.tdwg.org/pipermail/tdwg-tapir/2008-April/000493.html</documentation>
<externalOutputModel location="http://rs.tdwg.org/tapir/cs/dwc/1.4/model/dw_core_geo_cur.xml"/>
<filter>
<equals>
<concept id="http://rs.tdwg.org/dwc/dwcore/CatalogNumber" />
<parameter name="name"/>
</equals>
</filter>
</searchTemplate>

Now I can write my query: http://www.antweb.org/tapirlink/www/tapir.php/antweb
op=search
&start=0
&limit=1
&template=http://bioguid.info/tapir/dwc_catalog_number.xml
&name=casent0100367

So, the AntWeb server is going to read this query, and call my web server to get the query template to figure out what I actually want. Am I the only person who thinks that this is insane? Can anybody imagine going through these hoops to access a GenBank record, or a PubMed record?

Perhaps it's me, and my obsession with linking individual data records (rather than harvested lots of records, or federated search). But it strikes me that harvesting is a simple task and not many people will be doing it (at least, not on the scale of GBIF), and federated search is a non-starter as our community can't keep data providers online to save themselves.

In many ways I think TAPIR (and DiGIR before it) missed what for me is the most basic use case, namely I have a specimen identifier and I want to get the record for that specimen. These services make it much harder than it needs to be. It's a symptom of our field's inability to deliver simple tools that do basic tasks well, rather than overly general and highly complex tools that are poorly documented. Of course, retrieving individual records woud be easy if we have resolvable GUIDs for specimens, but we've singularly failed to deliver that, so we are stuck with very clunky tools. There's got to be a better way...

Semantic Publishing: towards real integration by linking

PLoS Computational Biolgy has recently published "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" (doi:10.1371/journal.pcbi.1000361) by David Shotton and colleagues. As a proof of concept, they took Reis et al. (doi:10.1371/journal.pntd.0000228) and "semantically enhanced" it:
These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps.
The enhanced article is here: doi:10.1371/journal.pntd.0000228.x001. For background on these enhancements, see also David's companion article "Semantic publishing: the coming revolution in scientific journal publishing" (doi:10.1087/2009202, PDF preprint available here). The process is summarised in the figure below (Fig. 10 from Shotton et al., doi:10.1371/journal.pcbi.1000361.g010).



While there is lots of cool stuff here (see also Elsevier's Article 2.0 Contest, and the Grand Chalenge, for which David is one of the judges), I have a couple of reservations.

The unique role of the journal article?

Shotton et al. argue for a clear distinction between journal article and database, in contrast to the view articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging. I tend to favour the later viewpoint. Indeed, as I argued in my Elsevier Challenge entry (doi:10.1038/npre.2008.2579.1), I think we should publish articles (and indeed data) as wikis, so that we can fix the inevitable error. We can always roll back to the original version if we want to see the author's original paper.

Real linking

But my real concern is that the example presented is essentially "integration by linking", that is, the semantically enhanced version gives us lots of links to other information, but these are regular hyperlinks to web pages. So, essentially we've gone from pre-web documents with no links, to documents where the bibliography is hyperlinked (most online journals), to documents where both the bibliography and some terms in the text are hyperlinked (a few journals, plus the Shotton et al. example). I'm a tad underwhelmed.
What bothers me about this is:
  1. The links are to web pages, so it will be hard to do computation on these (unless the web page has easily retrievable metadata)
  2. There is no reciprocal linking -- the resource being linked to doesn't know it is the target of the link


Web pages are for humans

The first concern is that the marked-up article is largely intended for human readers. Yes, there are associated metadata files in RDF N3, but the core "added value" is really only of use to humans. For it to be of use to a computer, the links would have to go to resource that the computer can understand. A human clicking on many of the links will get a web page and they can interpret that, but computers are thick and they need a little help. For example, one hyperlinked term is Leptospira spirochete, linked to the uBio namebank record (click on the link to see it). The link resolves to a web page, so it's not much use to a computer (unless if has a scrapper for uBio HTML). Ironically, uBio serves LSIDs, so we could retrieve RDF metadata for this name (urn:lsid:ubio.org:namebank:255659), but there's nothing in the uBio web page that tells the computer that.

Of course, Shotton et al. aren't responsible for the fact that most web pages aren't easily interpreted by computers, but simply embedding links to web pages isn't a big leap forward. What could they have done instead? One approach is to link to resources that are computer-readable. For example, instead of linking the term "Oswaldo Cruz Foundation" to that organisation's home page (http://www.fiocruz.br/cgi/cgilua.exe/sys/start.htm?tpl=home), why not use the DBpedia URI http://dbpedia.org/page/Instituto_Oswaldo_Cruz? Now we get both a human-readable page, and extensive RDF that a computer can use. In other words, if we crawl the semantically enhanced PLoS article with a program, I want to be able to have that crawler follow the links and still get useful information, not the dead end of a HTML web page. Quite a few of the institutions listed in the enhanced paper have DBPedia URIs:


Why does this matter? Well, if you use DBPedia URIs you get RDF, plus you get connections with the Linked Data crowd, who are rapidly linking diverse data sets together:


I think this is where we need to be headed, and with a little extra effort we can get there, once we move on from thinking solely about human readers.

An alternative approach (and one that I played with in my Challenge entry, as well as my ongoing wiki efforts) is to create what Vandervalk et al. term a "semantic warehouse" (doi:10.1093/bib/bbn051). Information about each object of interest is stored locally, so that clicking on a link doesn't take you off-site into the world wide wilderness, but to information about that object. For example, the page for the paper Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio) lists the papers cited, clicking on one takes you to the page about that paper. There are limitations to this approach as well, but the key thing is that one could imagine doing computations over this (e.g., computing citation counts for DNA sequences, or geospatial queries across papers) that simple HTML hyperlinking won't get you.

Reciprocal links

The other big issue I have with the Shotton et al. "integration by linking" is that it is one-way. The semantically enhanced paper "knows" that it links to, say, the uBio record for Leptospira, but uBio doesn't know this. It would enhance the uBio record if it knew that doi:10.1371/journal.pntd.0​000228.x001 linked to it.

Links are inherently reciprocal, in the sense that if paper 1 cites paper 2, then paper 2 is cited by paper 1.

Publishers understand this, and the web page of an article will often show lists of papers that cite the paper being displayed. How do we do this for data and other objects of interest? If we database everything, then it's straightforward. CrossRef is storing citation metadata and offers a "forward linking" service, some publishers (e.g., Elsevier and Highwire) offer their own versions of this. In the same way, this record for GenBank sequence AY322281 "knows" that it is cited by (at least) two papers because I've stored those links in a database. Knowing that you're being linked to dramatically enhances discoverability. If I'm browsing uBio I gain more from the experience if I know that the PLoS paper cites Leptospira.

Knowing when you're being linked to

If we database everything locally then reciprocal linking is easy. But, realistically, we can't database everything (OK, maybe that's not strictly true, can can think of Google as a database of everything). The enhanced PLoS paper "knows" that it cites the uBio record, how can the uBio record "know" that it has been cited by the PLoS paper? What if the act of linking was reciprocal? How can we achieve this in a distributed world? Some possibilities:
  • we have an explicit API embedded in the link so that uBio can extract the source of the link (could be spoofed, need authentication?)
  • we use OpenURL-style links that embed the PLoS DOI, so that uBio knows the source of the link (OpenURL is a mess, but potentially very powerful)
  • uBio uses the HTTP referrer header to get the source of the link, then parses the PLoS HTML to extract metadata and the DOI (ugly screen scraping, but no work for PLoS)

Obviously this needs a little more thought, but I think that real integration by linking requires that the resources being linked are both computer and human readable, and that both resources know about the link. This would create much more powerful "semantically enhanced" publications.