Search this keyword

NCBI RDF

Following on from the last post, I've now set up a trivial NCBI RDF service at bioguid.info/taxonomy/ (based on the ISSN resolver I released yesterday and announced on the Bibliographic Ontology Specification Group).

If you visit it in a web browser it's nothing special. However, if you choose to display XML you'll see some simple RDF. I've mapped some NCBI fields to corresponding terms in ttp://rs.tdwg.org/ontology/voc/TaxonConcept# (including the deprecated rankString term, which really shouldn't be deprecated, IMHO). I've also extracted what LSIDs I can from any linkouts. For example, a name that appears in Index Fungorum will have the corresponding LSID, likewise for IPNI. URLs are simply listed as rdfs:seeAlso.

Here's the RDF for NCBI taxon 101855 (you can grab this from http://bioguid.info/taxonomy/101855):


<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#">
<tc:TaxonConcept rdf:about="taxonomy:101855 ">
<dcterms:title>Lulworthia uniseptata</dcterms:title>
<dcterms:created>1999-08-16</dcterms:created>
<dcterms:modified>2005-01-19</dcterms:modified>
<dcterms:issued>1999-09-14</dcterms:issued>
<tc:nameString>Lulworthia uniseptata</tc:nameString><tc:rankString>species</tc:rankString>
<tcommon:taxonomicPlacementFormal>cellular organisms, Eukaryota, Fungi/Metazoa group, Fungi, Dikarya, Ascomycota, Pezizomycotina, Sordariomycetes, Sordariomycetes incertae sedis, Lulworthiales, Lulworthiaceae, Lulworthia</tcommon:taxonomicPlacementFormal>
<tc:hasName rdf:resource="urn:lsid:indexfungorum.org:names:105488"/>
<rdfs:seeAlso rdf:resource="http://www.marinespecies.org/aphia.php?p=taxdetails&id=100407"/>
<rdfs:seeAlso rdf:resource="http://www.mycobank.org/MycoTaxo.aspx?Link=T&Rec=105488"/>
<rdfs:seeAlso rdf:resource="http://www.indexfungorum.org/Names/namesrecord.asp?RecordId=105488"/>
<rdfs:seeAlso rdf:resource="http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=194551"/>
<rdfs:seeAlso rdf:resource="http://www.mycobank.org/MycoTaxo.aspx?Link=T&Rec=341143"/>
</tc:TaxonConcept>
</rdf:RDF>


Note the tc:hasName link to urn:lsid:indexfungorum.org:names:105488.

All a bit crude. The NCBI lookup is live (i.e., it's not served from a local copy of the database). I'll look at fixing this at some point, as well as caching the linkout lookups (one advantage of the live query is you can get the three dates (created, modified, and published). But for now it's a starting point to start to play with SPARQL queries across NCBI taxonomy, Index Fungorum, and IPNI using a common vocabulary.

NCBI taxonomy, TDWG vocabularies, and RDF


Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).

Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.

One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.

To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).

Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.

Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that

urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488

and, I can add the statement

urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716

What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):


<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
<rank rdf:resource="http://purl.uniprot.org/core/Species"/>
<scientificName>Lulworthia uniseptata</scientificName>
<otherName>Zalerion maritimum</otherName>
<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
<partOfLineage>false</partOfLineage>
</rdf:Description>
</rdf:RDF>


Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.

Wikipedia taxonomy, the good, the bad, and the very ugly

In the previous post I suggested that a productive way to meet EOL's goal of a web page per taxon would be to build upon Wikipedia, rather than go it alone. In a nutshell the arguments were:

  1. Wikipedia has considerable traction and has some richly populated taxon pages

  2. The linked data community uses DBPedia.org as a core source of URIs for entities, as as DBPedia is derived from Wikipedia the later will be the core source of identifiers for taxa

To explore this a little further I grabbed two files from the 20090618 Wikipedia dump, namely page.sql and templatelinks.sql, and extracted page ids and titles for Wikipedia pages containing the Taxobox template. I then queried Wikipedia for the source for each of these pages, and tried to extract the taxonomic information from each page (a tedious and error-prone process at best).

I've put together a shockingly crude web page where you can browse the results (warning, this page is a 10 minute hack with little error checking).

There is some good news. There are over 120,000 taxon pages (I've not got an exact figure because the Taxobox template occurs on some pages that aren't taxon pages, such as documentation and user pages). Some pages are extensive (the largest page is Dinosaur for which the source text is 128K in size), and there are lots of links to external references (I counted 7205 distinct DOIs to papers and/or books, and 3248 distinct ISBNs). This represents a degree of external linkage that puts EOL to shame.

However, there are also some major problems. Firstly, Wikipedia does not have a single, internally consistent classification (i.e., the classification is not a tree). This is not unexpected, given that Wikipedia pages comprise semi-structured text that is (largely) manually entered. It's not a database. If it were, the simplest way to ensure consistency would be to have each child node include a pointer to its parent, and when we want a list of the children of the parent node we simply query the database ("what nodes have this node as their parent?"). Because Wikipedia isn't a database, authors have entered these two relationships ("has parent" and "has child") on different pages, and these often conflict. For a spectacular example of this, take a look at the page for Amphibia. When I scrapped Wikipedia I extracted the "has parent" link, as this is the simplest way to create a tree. This results in over 200 child taxa for Amphibia, yet the Wikipedia page for Amphibia lists only four child taxa. What appears to be happening is that many fossil taxa are being added to Wikipedia, and since we are often hazy about where they go in the tree, authors are listing their parent taxon as (in this case) "Amphibia". Given this direct link, they should also be listed as children of Amphibia (although, of course, that would make a mess of the Amphibia page). Perhaps the solution is to add a "incerta sedis" taxon page for each taxon, and make that the parent of all the taxa that we're aren't sure where to put. This would ensure consistency, but not make the current taxon pages unreadable.

Homonymy (the same name for different taxa) also raises it's ugly head. For example, the page for the crab family Latreilliidae lists the genus Latreillia, which is a fly. In this case, the fly genus Latreillia Robineau-Desvoidy, is a junior homonym of the crab genus Latreillia Roux (see http://biodiversitylibrary.org/page/12221111).

Finally, the page titles (which become the basis of DBPedia.org URIs) are a muddled mixture of common and scientific names.

So, what to do? Well, the idea of simply using Wikipedia as is isn't going to fly, it's too broken. We will have to contemplate a concerted effort to fix it (which will require using bots to clean up the inconsistencies). Another option (assuming that we like the Wiki-style environment) is to use a semantic wiki (see my earlier post), which constrains some of the possible markup, but retains a lot of the freedom that make wikis so powerful.

This isn't an argument for not using Wikipedia as such, it's arguably still much more informative than, say, EOL. It's just that it's showing signs of the limitations of free-form text entry. The trick is to find a way to combine the obvious strengths of this approach (ease of creating and editing pages, massive community support) with the more structured approach needed to avoid the internal inconsistencies that currently bedevil Wikipedia.