Search this keyword

Referring to a one-degree square in RDF using c-squares

I'm in the midst of rebuilding iSpecies (my mash-up of Wikipedia, NCBI, GBIF, Yahoo, and Google search results) with the aim of outputting the results in RDF. The goal is to convert iSpecies from a pretty crude "on-the-fly" mash-up to a triple store where results are cached and can be queried in interesting ways. Why? Partly because I think such a triple store is an obvious way to underpin a "biodiversity hub" of the kind envisaged by PLoS (see my earlier post).

As ever, once one embarks down the RDF route (and I've been here before), one hits all the classic stumbling blocks, such as "what URI do I use for a thing?", and "what vocabulary do I use to express relationships between things?". For example, I'd like to represent the geographic distribution of a taxon as depicted on a GBIF map. How do I describe this in a RDF document?

To make this concrete, take one of my favourite animals, the New Zealand mud crab Helice crassa. Here's the GBIF map for this taxon:

wms.png
This map has the URL (I kid you not):

http://ogc.gbif.org/wms?request=GetMap
&bgcolor=0x666698
&styles=,,,
&layers=gbif:country_fill,gbif:tabDensityLayer,gbif:country_borders,gbif:country_names
&srs=EPSG:4326
&filter=()(
%3CFilter%3E
%3CPropertyIsEqualTo%3E
%3CPropertyName%3Eurl
%3C/PropertyName%3E
%3CLiteral%3E
%3C![CDATA[http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F17462693%2FtenDeg%2F-45%2F160%2F]]%3E
%3C/Literal%3E
%3C/PropertyIsEqualTo%3E
%3C/Filter%3E)()()
&width=721
&height=362
&Format=image/png
&bbox=160,-45,180,-35

(or http://bit.ly/cuTFW9, if you prefer). Now, there's no way I'm using this URL! Plus, the URL identifies an image, not the distribution.

But, if we look at the map we see that it is made of 1° × 1° squares. If each of those had a URI then I could simply list those squares as the distribution of the crab. This seems straightforward as GBIF has a service that provides these squares. For example, the URL http://data.gbif.org/species/17462693 (where 17462693 corresponds to Helice crassa) returns:

MINX MINY MAXX MAXY DENSITY
167.0 -45.0 168.0 -44.0 5
174.0 -42.0 175.0 -41.0 20
174.0 -38.0 175.0 -37.0 17
174.0 -37.0 175.0 -36.0 4

These are the 1° × 1° squares for which there are records of Helice crassa. Now, what I'd like to do is have a URI for each square, and I'd like to do this without reinventing the wheel. I've come across a URI space for points of the globe (the WGS 84 Geographic Point URI Space"), but not one for polygons. Then it dawned on me that perhaps c-squares, developed by Tony Rees at the CSIRO in Australia, would do the trick1. To quote Tony:
C-squares is a system for storage, querying, display, and exchange of "spatial data" locations and extents in a simple, text-based, human- and machine- readable format. It uses numbered (coded) squares on the earth's surface measured in degrees (or fractions of degrees) of latitude and longitude as fundamental units of spatial information, which can then be quoted as single squares (similar to a "global postcode") in which one or more data points are located, or be built up into strings of codes to represent a wide variety of shapes and sizes of spatial data "footprints".

C-squares appeal partly (and this says nothing good about me) because they have a slightly Byzantine syntax. However, they are short, and quite easy to calculate. I'll let the reader find out the gory details. To give an example, my home town, Auckland, has latitude -36.84, longitude 174.74, which corresponds to the 1° × 1° c-square with the code 3317:364.

Now, all I need to do is convert c-squares into URIs. If you append the c-square to http://bioguid.info/csquare:, like this, http://bioguid.info/csquare:3317:364, you get a linked data-friendly URI for the c-square. In a web browser you get a simple web page like this:

csquare.png
A linked data client will get RDF, like this:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
xmlns:geom="http://fabl.net/vocabularies/geometry/1.1/">
<dcterms:Location rdf:about="http://bioguid.info/csquare:3307:364">
<rdfs:label>3307:364</rdfs:label>
<geom:xmin>74</geom:xmin>
<geom:ymin>-37</geom:ymin>
<geom:xmax>75</geom:xmax>
<geom:ymax>-36</geom:ymax>
<dwc:footprintWKT>POLYGON((-37 75,-37 74,-36 74,-36 75,-37 75))</dwc:footprintWKT>
</dcterms:Location>
</rdf:RDF>

Now, I can refer to each square by it's own URI. This will also enable me to query a triple store by c-square (e.g., what other taxa occur within this 1° × 1° square?).
  1. Tony Rees had emailed me about this in response to a tweet about URIs for co-ordinates, but it took me a while to realise how useful c-square notation could be.

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:
canopy.png
When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...

Linnaeus meets the Internet: PLoS + Botany = #fail

C2914D0E-13E9-4CA6-BE0A-7A8645BC6A72.jpgTo much fanfare (e.g., Nature News, "Linnaeus meets the Internet" doi:10.1038/news.2010.221), on May 5th PLoS ONE published Sandy Knapp's "Four New Vining Species of Solanum (Dulcamaroid Clade) from Montane Habitats in Tropical America" doi:10.1371/journal.pone.0010502. To quote the Nature News piece:
The paper represents the culmination of a campaign to institute the electronic publication of scientific names, a case Knapp and others have made in journals including Nature[doi:10.1038/446261a]. Allowing electronic publication should make accessing information easier for scientists worldwide — especially those in developing countries who may not have access to fully stocked libraries. This, in turn, will aid conservation efforts, Knapp says.

Given the profile of this paper, "...the first time new plant names have been published in a purely electronic journal and still complied with ICBN rules", you'd think the participants would ensure the electronic aspects of the publication worked. Sadly, this is not the case.

The four names in question have apparently been deposited in IPNI with the following LSID's:

  • Solanum aspersum: urn:lsid:ipni.org:names:77103633-1

  • Solanum luculentum: urn:lsid:ipni.org:names:77103634-1

  • Solanum sanchez-vegae: urn:lsid:ipni.org:names:77103635-1

  • Solanum sousae: urn:lsid:ipni.org:names:77103636-1


Today is May 6th. None of these names are returned by a search of IPNI, for example http://www.ipni.org/ipni/simplePlantNameSearch.do?find_wholeName= returns this:

ipni1.png

Resolving the LSID returns this:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#"
xmlns:tm="http://rs.tdwg.org/ontology/voc/Team#"
xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:p="http://rs.tdwg.org/ontology/voc/Person#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#">
<tn:TaxonName rdf:about="urn:lsid:ipni.org:names:77103633-1">
<tcom:versionedAs rdf:resource="urn:lsid:ipni.org:names:77103633-1:1.2"/>
<tcom:Deleted>Yes</tcom:Deleted>
</tn:TaxonName>
</rdf:RDF>

Hmmm, so apparently this record has been "deleted"?

The paper also states that:
The IPNI LSIDs (Life Science Identifiers) can be resolved and the associated information viewed through any standard web browser by appending the LSID contained in this publication to the prefix http://ipni.org/.

This sentence mirrors similar ones in other PLoS ONE papers saying we can resolve ZooBank LSIDs by appending the LSID to http://zoobank.org (e.g., see doi:10.1371/journal.pone.0001787).

Thing is, URLs such as http://ipni.org/urn:lsid:ipni.org:names:77103633-1 return a 404 from Kew (any IPNI LSID I've tried does this).


Update As per Alan Paton's comment below, the http://ipni.org prefix now works.


So, to recap:

  1. The names aren't in IPNI

  2. The LSIDs state the record has been deleted

  3. The LSID's can't be resolved by the means stated in the paper

Now, I don't know what happened (perhaps IPNI wanted to hold off until the paper actually appeared before releasing the names), but the paper is out, the buzz in Nature is out, and IPNI doesn't have the resolver in place, yet alone the names.

Given the milestone this paper represents, and the fuss over the publication of the name Darwinius, you'd expect the bioinformatics side of it to be, you know, actually working. In these circumstances, how on Earth do we make the case that the LSID and name databasing side of taxonomic publication is useful?