Search this keyword

Discovering species descriptions in digitised newspapers: Trove and The Brisbane Courier


While exploring ways to visually compare classifications I came across the Australian snake name Demansia atra, and ended up reading a series of papers in the Bulletin of Zoological Nomenclature discussing the status of the name (more fun than it sounds, trust me). For example, Smith and Wallach Case 2920. Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes): proposed conservation of the specific name asked the ICZN to conserve the name, whereas Shea On the proposed conservation of the specific name of Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes) argued that Hoplocephalus vestigiatus was the correct name for the snake (OK, perhaps not that much fun.)

The reason I bring this up is that the original description of Hoplocephalus vestigiatus was published in an Australian newspaper, the The Brisbane Courier, 13 September 1884 (!). This newspaper has been digitised and is available in Trove, a digital archive hosted by the National Library of Australia. The description of Hoplocephalus vestigiatus appears in an account of a meeting of the Royal Society of Queensland: http://trove.nla.gov.au/ndp/del/article/3434083.

The Trove newspaper archive has both scanned images and OCR text, rather like BHL, but also enables users to correct OCR errors. The original text looked like this:

Mr De Vis then read a communication en
titled " Desciiptioua of >icw Snakes' -
This papei fe ive the descriptions of four
udditions to our Austi alian snake fauna,
and was prefaced by a synopsis and ditfii ential
characters of the genoa Huplo fjihaliui, to
which two of the anakca were referred-a genus
which Mr De Via stated w as out of all propor
tion larger than any other of anakea in Aus
traha, and, though consisting for the most
deadly reptile of Queensland-the brown
banded or "tiger snake " The new sn ikes were
Hoplocephalus snlcans-the furrow snake, so
called from the faculty the reptile has of con
verting ita ventral surface into a continuous
furrow, forwaidcd by Mr. C W de Burgh
Birch from the Mitchell district Hnplore
phalus lestigialiis, the foot punt snake, a name
of the white mai kings upon its back to tracks
of feet, Cacoplm, Wari o from Warroo station,
in the Port Curtía district, where it was
collected by Mi Llackman, one of the mern
bcis of the soeietj , and IJrarln/ioma Snther
lundi, a snakefrom Carl Creek, Norman River,
dcdicatcel to Mi J Sutherland, of normanton,


A few quick edits in Trove and it looks like this:

Mr. De Vis then read a communication en-
titled "Descriptions of New Snakes."—
This paper gave the descriptions of four
additions to our Australian snake fauna,
and was prefaced by a synopsis and differential
characters of the genus Hoplocephalus to
which two of the snakes were referred—a genus
which Mr. De Vis stated was out of all propor-
tion larger than any other of snakes in Aus-
tralia, and, though consisting for the most
deadly reptile of Queensland—the brown
banded or "tiger snake." The new snakes were
Hoplocephalus sulcans—the furrow snake, so
called from the faculty the reptile has of con-
verting its ventral surface into a continuous
furrow, forwarded by Mr. C. W. de Burgh
Birch from the Mitchell district . Hoploce-
phalus vestigitatus, the foot-print snake, a name
said to be suggested by the fancied resemblance
of the white markings upon its back to tracks
of feet; Cacophis Warro, from Warroo station,
in the Port Curtis district, where it was
collected by Mr. Blackman, one of the mem-
bers of the society; and Brachysoma Suther-
landi, a snake from Carl Creek, Norman River,
dedicated to Mr. J. Sutherland, of Normanton.
Searching Trove for Hoplocephalus I discovered a number of articles on snakes, some of which have also had their OCR text corrected, a measure of the success the project has had in engaging users. Trove has come up several times in discussions abut OCR correction and BHL, but this is the first time I've taken a closer look — I didn't expect to find species descriptions in an Australian newspaper.

Linking NCBI taxonomy to GBIF


In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:

<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Can you trust EOL?

There's a recent thread on the Encyclopedia of Life concerning erroneous images for the crab Leptograpsus. This is a crab I used to chase around rooks on stormy west-coast beaches near Auckland, so I was a little surprised to see the EOL page for Leptograpsus looks like this:

Leptograpsus

The name and classification is the crab, but the image is of a fish (Lethrinus variegatus). Perhaps at some point in aggregating the images the two taxa, which share the abbreviated name "L.variegatus" got mixed up.

Now, errors like this are bound to happen in a project the size of EOL, and EOL has some pretty active efforts to clean up errors (e.g., the Homonym Hunters). But what bothers me about this example is the prominent label Trusted that appears below the image. If I look at all the images for Leptograpsus on EOL, I see "trusted" images for fish. All images of the crab (i.e., the real Leptograpsus) are labelled "unreviewed" and implicitly "untrusted":

Leptograpsus2

If you are going to claim something is "trusted" you need to be very careful. The images of the fish may well come from a trusted source (FishBase), and FishBase's assertion that the image is of Lethrinus variegatus may well be "trusted", but I certainly can't trust the assertion made by EOL that this image depicts a crab.

In this example the error is easy to spot (if you know that crabs and fish are different), but what if the error was more subtle? Or what if you are using EOL's API and explicitly asking for only content you can trust? Then you get the fish images (see https://gist.github.com/2850321).

If I can't trusted "trusted" then EOL has a problem. One way forward is to unpack the notion of "trust" and make sure the user knows what "trusted" means. In this case there are at least two assertions being made:
  1. This image is of a fish (made by FishBase)
  2. This image is of a crab (made by EOL)

EOL needs to make clear what assertions are being made, and which ones it is stating can be "trusted". Ideally it also needs to move away from blanket assertions of "trusted" versus not trusted, because that's far too coarse (just because FishBase knows about fish I'm not sure I'm going to put equal trust that every image it contains has been correctly identified). Trust is something that is conferred by users and acquired over time, not something to be simply asserted.