One thing I find myself doing a lot is creating Excel spreadsheets and filling them will lists of taxonomic names and bibliographic references, for which I then try to extract identifiers (such as DOIs). This is a tedious business, but the hope is that by doing it once I can create a useful resource. However, often I get bored and the spreadsheets lie forgotten in some deep recess of my computer's hard drive.
It occurs to me that making these spreadsheets publicly available would be useful, but how to do this? In particular, how to do this in a way that makes it easy for me to extract recent edits, and to update the data from new sources? Google Spreadsheets seems an obvious answer, but I wasn't aware of just how obvious until I started playing with the spreadsheet APIs. These enable you to add data via the API (using HTTP PUT and ATOM), which means that I can easily push new data to the spreadsheet.
As a test, I've harvested the IPNI RSS feeds I created earlier (see http://bioguid.info/rss), extracted basic details about the name and any bibliographic identifiers my RSS generator had found, and sent these direct to a Google Spreadsheet. Some IPNI references didn't parse, so I can manually edit these, and many references lack an identifier (my tools usually finds those with DOIs). Often with a bit searching one can turn up a URL or a Handle to a paper, or even simply expand on the bibliographic details (which are a bit skimpy in IPNI). I'm also toying with using Zotero as an online bibliographic store for references that don't have an online presence.
So, what I've got now is a spreadsheet that can be edited, updated, and harvested, and will persist beyond any short term enthusiasm I have for trying to annotate IPNI.
NCBI RDF
Following on from the last post, I've now set up a trivial NCBI RDF service at bioguid.info/taxonomy/ (based on the ISSN resolver I released yesterday and announced on the Bibliographic Ontology Specification Group).
If you visit it in a web browser it's nothing special. However, if you choose to display XML you'll see some simple RDF. I've mapped some NCBI fields to corresponding terms in ttp://rs.tdwg.org/ontology/voc/TaxonConcept# (including the deprecated rankString term, which really shouldn't be deprecated, IMHO). I've also extracted what LSIDs I can from any linkouts. For example, a name that appears in Index Fungorum will have the corresponding LSID, likewise for IPNI. URLs are simply listed as rdfs:seeAlso.
Here's the RDF for NCBI taxon 101855 (you can grab this from http://bioguid.info/taxonomy/101855):
Note the tc:hasName link to urn:lsid:indexfungorum.org:names:105488.
All a bit crude. The NCBI lookup is live (i.e., it's not served from a local copy of the database). I'll look at fixing this at some point, as well as caching the linkout lookups (one advantage of the live query is you can get the three dates (created, modified, and published). But for now it's a starting point to start to play with SPARQL queries across NCBI taxonomy, Index Fungorum, and IPNI using a common vocabulary.
If you visit it in a web browser it's nothing special. However, if you choose to display XML you'll see some simple RDF. I've mapped some NCBI fields to corresponding terms in ttp://rs.tdwg.org/ontology/voc/TaxonConcept# (including the deprecated rankString term, which really shouldn't be deprecated, IMHO). I've also extracted what LSIDs I can from any linkouts. For example, a name that appears in Index Fungorum will have the corresponding LSID, likewise for IPNI. URLs are simply listed as rdfs:seeAlso.
Here's the RDF for NCBI taxon 101855 (you can grab this from http://bioguid.info/taxonomy/101855):
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#">
<tc:TaxonConcept rdf:about="taxonomy:101855 ">
<dcterms:title>Lulworthia uniseptata</dcterms:title>
<dcterms:created>1999-08-16</dcterms:created>
<dcterms:modified>2005-01-19</dcterms:modified>
<dcterms:issued>1999-09-14</dcterms:issued>
<tc:nameString>Lulworthia uniseptata</tc:nameString><tc:rankString>species</tc:rankString>
<tcommon:taxonomicPlacementFormal>cellular organisms, Eukaryota, Fungi/Metazoa group, Fungi, Dikarya, Ascomycota, Pezizomycotina, Sordariomycetes, Sordariomycetes incertae sedis, Lulworthiales, Lulworthiaceae, Lulworthia</tcommon:taxonomicPlacementFormal>
<tc:hasName rdf:resource="urn:lsid:indexfungorum.org:names:105488"/>
<rdfs:seeAlso rdf:resource="http://www.marinespecies.org/aphia.php?p=taxdetails&id=100407"/>
<rdfs:seeAlso rdf:resource="http://www.mycobank.org/MycoTaxo.aspx?Link=T&Rec=105488"/>
<rdfs:seeAlso rdf:resource="http://www.indexfungorum.org/Names/namesrecord.asp?RecordId=105488"/>
<rdfs:seeAlso rdf:resource="http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=194551"/>
<rdfs:seeAlso rdf:resource="http://www.mycobank.org/MycoTaxo.aspx?Link=T&Rec=341143"/>
</tc:TaxonConcept>
</rdf:RDF>
Note the tc:hasName link to urn:lsid:indexfungorum.org:names:105488.
All a bit crude. The NCBI lookup is live (i.e., it's not served from a local copy of the database). I'll look at fixing this at some point, as well as caching the linkout lookups (one advantage of the live query is you can get the three dates (created, modified, and published). But for now it's a starting point to start to play with SPARQL queries across NCBI taxonomy, Index Fungorum, and IPNI using a common vocabulary.
NCBI taxonomy, TDWG vocabularies, and RDF

Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).
Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.
One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.
To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).
Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.
Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that
urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488
and, I can add the statement
urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716
What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):
<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
<rank rdf:resource="http://purl.uniprot.org/core/Species"/>
<scientificName>Lulworthia uniseptata</scientificName>
<otherName>Zalerion maritimum</otherName>
<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
<partOfLineage>false</partOfLineage>
</rdf:Description>
</rdf:RDF>
Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.
Labels:
RDF
,
SPARQL
,
TDWG
,
Uniprot
,
vocabulary
Subscribe to:
Posts
(
Atom
)