Search this keyword

Citations, Social Media & Science

Quick note that Morgan Jackson (@BioInFocus) has written nice blog post Citations, Social Media & Science inspired by the fact that the following paper:

Kwong, S., Srivathsan, A., & Meier, R. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, no–no. doi:10.1111/j.1096-0031.2012.00408.x

cites my "Dark taxa" in the body of the text but not in the list of literature cited. This prompted some discussion of DOIs and blog posts on Twitter:



Read Morgan's post for more on this topic. While I personally would prefer to see my blog posts properly cited in papers like doi:10.1111/j.1096-0031.2012.00408.x, I suspect the authors did what they could given current conventions (blogs lack DOIs, are treated differently from papers, and many publishers cite URLs in the text, not the list of references cited). If we can provide DOIs (ideally from CrossRef so we become part of the regular citation network), suitable archiving, and — most importantly — content that people consider worthy of citation then perhaps this practice will change.

Post GBIC2012 thoughts

I'm back from Copenhagen and GBIC2012. The meeting spanned three fairly intense days (with the days immediately before and after also working days for some of us), and was run by a group of facilitators lead by Natasha Walker, who were described us as "an interesting (and delightfully brainy, if sometimes scatty) group of academics, researchers, museum managers and people close to policy...". I've attempted to capture tweets about the meeting using Storify.

There will be a document (perhaps several) based on the meeting, but until then here are a few quick thoughts. Note that the comments below are my own and you shouldn't read into this anything about what directions the GBIC document(s) will actually take.

Microbiology rocks


Highlight of the first day was Robert J. Robbin's talk which urged the audience to consider that life was mostly microbial, that the the things most people in the room cared about were actually merely a few twigs on the tree of life, that the tree of life didn't actually exist anyway, and many of the concepts that made sense for multicellular organisms simply didn't apply in the microbial world. Basically it was a homage to Carl Woese (see also Pace et al. 2012 doi:10.1073/pnas.1109716109) and a wake up call to biodiversity informaticians to stop viewing the world through multicellular eyes. (You can find all the keynotes from the first day here).

F1 large
From Pace, N. R. (1997). A Molecular View of Microbial Diversity and the Biosphere. Science, 276(5313), 734–740. doi:10.1126/science.276.5313.734

Sequences rule


The future of a lot of biodiversity science belongs to sequences, from simple DNA barcoding as a tool for species discovery and identification, metabarcoding as a tool for community analysis, to comparisons of metabolic pathways and beyond. The challenge for classical biodiversity informatics is how to engage with this, and to what extent we should try and map between, say sequences and classical taxa, or whether it might make more sense (gasp) to abandon the taxonomic legacy and move on. Perhaps are more nuanced response is that the point of connection between sequences and classical biodiversity data is unlikely to be at the level of taxonomic names (which are mostly tags for collections of things that look similar) but at the level of specimens and observations.

Ontologies considered harmful


This is my own particular hobby horse. Often the call would come "we need an ontology", to which I respond read Ontology is Overrated: Categories, Links, and Tags. I have several problems with ontologies. The first is that they are too easy to make and distract from the real problem. From my perspective a big challenge is linking data together, that is going from

a

to

b

Let's leave aside what "A" and "B" are (I suspect it matters less than people think), once we have the link then we can can start to do stuff. From my perspective, what ontologies give us is basically this:

c

So now we know the "type" of the link (e.g., "is a part of", "cites approvingly", etc.). I'm not arguing that this isn't useful to have, but if you don't have the network of links then typing the links becomes an idle exercise.

To give an example, the web itself can be modelled as simply nodes connected by links, ignoring the nature of the links between the web pages. The importance of those links can be inferred later from properties of the network. To a first approximation this is how Google works, it doesn't ask what the links "mean" it simply investigates the connections to determine how important each web page is. In the same way, we build citation networks without bothering to ask the nature of the citation (yes I know there are ontologies for citations, but anyone willing to bet how widely they'll be adopted?).

My second complaint is that building ontologies is easy, "easy" in the sense that get a bunch of people together, they squabble for a long time about terminology, and out comes an ontology. Maybe, if you're lucky, someone will adopt it. The cost of making ontologies, and indeed of adopting them is relatively low (although it might not seem like it at the time). The cost of linking data is, I'd argue, higher, because it requires that you trust someone else's identifiers to the extent that you use them for things you care about deeply. Consider the citation network that is emerging from the widespread adoption of DOIs by the publishing industry. Once people trust that the endpoints of the links will survive, then the network starts to grow. But without that trust, that leap of faith, there's no network (unless you have enough resources to build the whole thing internally yourself, which is what happened with the closed citation network owned by Thomson Reuters). It's much easier to silo the data using unique identifiers than it is to link to other data (it's a variant of the "not invented here" syndrome).

Lastly, ontologies can have short lives. They reflect a certain world view that can become out of date, or supplanted if the relationships between things that the ontology cares about can be computed using other data. For example, biological taxonomy is a huge ontology that is rapidly being supplanted by phylogenetic trees computed from sequence (and other) data (compare the classification used by flagship biodiversity projects like GBIF and EOL with the Pace tree of life shown above). Who needs an ontology when you can infer the actual relationships? Likewise, once you have GPS the value of a geographic ontology (say of place names) starts to decline. I can compute if I'm on a mountain simply by knowing where I am.

I'm not saying ontologies are always bad (they're not), nor that they can't be cool (they can be), I'm just suggesting that they aren't the first thing you need. And they certainly aren't a prerequisite for linking stuff together.

Google flu trends


Perhaps the most interesting idea that emerged was the notion of intelligently detecting changes in biodiversity (which is the kind of thing a lot of people want to know) in the way analogous to Google.org's Flu Trends uses flu-related search terms to predict flu outbreaks:

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634

Could we do something like this for biodiversity data? For various reasons this suggestion become known at GBIC2012 as the "Heidorn paradigm".

Thinking globally


One challenge for a meeting like GBIC 2012 is scope. There's so much cool stuff to think about. From my perspective, a useful filter is to ask "what will happen anyway?" In other words, there is a lot of stuff (for example the growth of metabarcoding) that will happen regardless of anything the biodiversity informatics community does. People will make taxon-specific ontologies for organismal traits, digitise collections, assess biodiversity, etc. without necessarily requiring an entity like GBIF. The key question is "what won't happen at a global scale unless GBIF (or some other entity) gets involved?"

A Vast Machine

51OttqQDcVL SL500 AA300Lastly, in one session Tom Moritz mentioned a book that he felt we could learn from (A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming). The book recounts the history of climatology and its slow transition to a truly global science. I've started to read it, and it's fascinating to see the interplay between early visions of the future, and the technology (typically driven by military or large-scale commercial interests) that made possible the realisation of those visions. This is one reason why predicting the future is such a futile activity, the things that have the biggest effect come from unexpected sources, and effect things in ways it's hard to anticipate. On a final note, it took about a minute from the time from the time Tom mentioned the book to the time I had a copy from Amazon in the Kindle app on my iPad. Oh that accessing biodiversity data were that simple.

Using orthographic projections to map organism distributions

For a current project I'm currently working I show organism distributions using data from GBIF, and I display that data on a map that uses the equirectangular projection. I've recently started to create a series of base maps using the GBIF colour scheme, which is simple but effective:

  • #666698 for the sea
  • #003333 for the land
  • #006600 for borders
  • yellow for localities


The distribution map is created by overlaying points on a bitmap background using SVG (see SVG specimen maps from SPARQL results for details). SVG is ideally suited to this because you can take the points, plot them in the x,y plane (where x is longitude and y is latitude) then use SVG transformations to move them to the proper place on the map.

For the base maps themselves I've also started to use SVG, partly because it's possible to edit them with a text editor (for example if you want to change the colours). I then use Inkscape to export the SVG to a PNG to use on the web site.

Gbif360x180

One thing that has bothered me about the equirectangular projection is that, although it is familiar and easy to work with, it gives a distorted view of the world:



This is particularly evident for organisms that have a circumpolar distribution. For example, Kerguelen's petrel Aphrodroma has a distribution that looks like this using the equirectangular projection:

A1

This long, thin distribution looks rather different if we display it on a polar projection:
A2

Likewise, classic Gondwanic distributions such as that of Gripopterygidae become clearer on a polar projection.

g

Computing the polar coordinates for a set of localities is straightforward (see for example this page) and using SVG to lay out the points also helps, because it's trivial to rotate them so that they match the orientation of the map. Ultimately it would be nice to have an embedded, rotatable 3D globe (like the Google Earth plugin, or a Javascript+SVG approach like this). But for now I think it's nice to have the option of using different projections available to help display distributions more faithfully.

The bitmap maps and their SVG sources are available on github.

Planet management, GBIF, and the future of biodiversity informatics

Earth russia large verge medium landscape

Next week I'm in Copenhagen for GBIC, the Global Biodiversity Informatics Conference. The goal of the conference is to:
...convene expertise in the fields of biodiversity informatics, genomics, earth observation, natural history collections, biodiversity research and policy needed to set such collaboration in motion.

The collaboration referred to is the agreement to mobilise data and informatics capability to met the Aichi Biodiversity Targets.

I confess I have mixed feelings about the upcoming meeting. There will be something like 100 people attending the conference, with backgrounds ranging from pure science to intergovernmental policy. It promises to be interesting, but whether a clear vision of the future of biodiversity informatics will emerge is another matter.

GBIC is part of the process of "planet management", a phrase that's been around for a while, but I only came across in the Bowker's essay "Biodiversity Datadiversity"1:

Bowker, G. C. (2000). Biodiversity Datadiversity. Social Studies of Science, 30(5), 643–683. doi:10.1177/030631200030005001

Bowker's essay is well worth a read, not least for the choice quotes such as:

Each particular discipline associated with biodiversity has its own incompletely articulated series of objects. These objects each enfold an organizational history and subtend a particular temporality or spatiality. They frequently are incompletely articulated with other objects, temporalities and spatialities — often legacy versions, when drawing on non-proximate disciplines. If one wants to produce a consistent, long-term database of biodiversity-relevant information the world over, all this sounds like an unholy mess. At the very least it suggests that global panopticons are not the way to go in biodiversity data. (p. 675, emphasis added)

and
I have not, in general, questioned the mania to name which is rife in the circles whose work I have described. There is no absolutely compelling connection between the observation that many of the world’s species are dying and the attempt to catalogue the world before they do. If your house is on fire, you do not necessarily stop to inventory the contents before diving out the window. However, as Jack Goody (1977) and others have observed, list-keeping is at the heart of our body politic. It is also, by extension, at the heart of our scientific strategies. Right or wrong, it is what we do. (p. 676, emphasis added)

Given that I'm a fan of the notion of a "global panopticon", and spend a lot of time fussing with lists of names, I find Bowker's views refreshing. Meantime, roll on GBIC2012.



1. Bowker cites Elichirigoity as a source of the term "planet management":

Fernando Elichirigoity (1999), Planet Management: Limits to Growth,
Computer Simulations, and the Emergence of Global Spaces (Evanston, IL: Northwestern
University Press). ISBN 0810115875 (Google Books oP3wVnKpGDkC).

From the limited Google preview, and the review by Edwards, this looks like an interesting book:

Edwards, P. (2000). Book Review:Planet Management: Limits to Growth, Computer Simulation, and the Emergence of Global Spaces Fernando Elichirigoity. Isis, 91(4), 828. doi:10.1086/385020 (PDF here)

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:

Screenshot

Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare http://dx.doi.org/10.6084/m9.figshare.92668.

UUIDs

Just for future reference:

More fictional taxa and the myth of the expert taxonomic database

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong. Here's yet another example.

If you search for the genus Raymondia in the Catalogue of Life you get multiple occurrences of the same species names, e.g.:



Both of these are listed as "provisionally accepted names", supplied by WTaxa: Electronic Catalogue of Weevil names (Curculionoidea). Clearly we can't have two species with the same name, so what's happening?

Firstly, Hustache, A., 1930 is:

Hustache A (1930) Curculionidae Gallo-Rhénans. Annales de la Société entomologique de France 99: 81-272. http://gallica.bnf.fr/ark:/12148/bpt6k6112240j/f3

On p. 246 Hustache refers to Raymondionymus fossor Aubé, 1864 (see below).

F168 highres

So, Raymondionymus fossor Hustache, A., 1930 is not a new species but simply the citation of a previously published one (it's a chresonym). Hustache cites the author of the name as Aubé, 1864, and you can see the original description by Aubé in BioStor (Description de six espèces nouvelles de Coléoptères d'Europe dont deux appartenant a deux genres nouveaux et aveugles, http://biostor.org/reference/104589). So, if the taxonomic authority should be Aubé, 1864, what about Raymondionymus fossor Ganglebauer, L., 1906? Again, if we track down the original publication (Revision der Blindrüsslergattungen Alaocyba und Raymondionymus, http://biostor.org/reference/104591) it's simply Ganglebauer citing (on p. 142) Aubé's paper, not describing a new species.

Note that the nomenclature of this weevil species is further complicated because Aubé originally described the species as Raymondia fossor, but Raymondia was already in use for a fly (see Über eine neue Fliegengattung: Raymondia, aus der Familie der Coriaceen, nebst Beschreibung zweier Arten derselben, http://biostor.org/reference/104588). To resolve this homonymy Wollaston proposed the name Raymondionymus:

Wollaston, T. V. (1873). XVIII. On the Genera of the Cossonidae. Transactions of the Royal Entomological Society of London, 21(4), 427–652. doi:10.1111/j.1365-2311.1873.tb00645.xhttp://biostor.org/reference/51301

So, we have a bit of a mess. Unfortunately this mess percolates up through other databases, for example EOL has three different pages for Raymondionymus fossor.

For me the lesson here is that relying on acquiring data from "trusted" sources, curated by "experts" is simply not a tenable strategy for building lists of taxa. If names are essential bits of biodiversity infrastructure upon which we hang other data, then these lists need to be cleaned, which means exposing them to scrutiny, and providing an easy means for errors to be flagged and corrected. Trust is something that is earned, not asserted, and it's time taxonomic databases stop claiming to be authoritative simply because they rely on expert sources. Expertise is no guarantee that you won't make errors.

For me this is one of the key reasons projects like BHL are so important. As more and more of the original literature becomes available, we lessen our reliance on "expertise". We can start to see for ourselves. In other words, "Nullius in verba" ("take nobody's word for it").

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...

BHL and text-mining: some ideas

Some quick notes on possibilities for text-mining BHL (in rough order of priority). Any text-mining would have to be robust to OCR errors. I've created a group of OCR-related papers on Mendeley:

OCR - Optical Character Recognition is a group in Computer and Information Science on Mendeley.

Improve finding taxonomic names in text in face of OCR errors

There is some published research on OCR errors that could be used to develop a tool to improve our ability to index OCR text. The outcome would be improved search in BHL (and other archives). I've touched on some of these issues earlier). One approach that looks interesting is using anagram hashing (see Reynaert, 2008), which may be a cheap way to support approximate string matching in OCR text.

Reynaert, M. (2008). Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. Lecture Notes in Computer Science, 4919:617-630. doi:10.1007/978-3-540-78135-6_53 (PDF here).


Recognition and extraction of literature cited

Given an article extract all the references it cites. There's a fair amount of literature on automated citation extraction, but again we need to do this in the face of OCR errors, and enormous variability in citation styles. The outputs could help build citation indexes, and also serve as data for the "bibliography of life". The citations could also be used to help locate further articles in BHL (e.g., using BioStor's OpenURL resolver).


Improved extraction of named entities (e.g., museum specimen codes) and localities (e.g., latitude and longitudes, place names)

This would enable better geographic searches, and help start to link literature to museum specimen databases.

Automated recognition of articles within scanned volumes

My own approach to finding articles has focussed on finding articles based on citation metadata, e.g. based on article title, journal, volume, and pagination, find corresponding article in BHL:

Page, R. D. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(1), 187. doi:10.1186/1471-2105-12-187

An alternative is to infer articles from just the scanned pages. There has been some limited work on this in the context of BHL:

Lu, X., Kahle, B., Wang, J. Z., & Giles, C. L. (2008). A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’08 (p. 167). Association for Computing Machinery (ACM).
doi:10.1145/1378889.1378918 (PDF here)

The NLM has some cool stuff on automatically labelling the parts of a document, see Automated Labeling in Document Images and Ground truth data for document image analysis. See also Distance Measures for Layout-Based Document Image Retrieval.

Other links
Should also note that there's a relevant question on StackOverflow about OCR correction, which has links to tools like OCRspell:

Taghva, K., & Stofsky, E. (2001). OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3), 125–137. doi:10.1007/PL00013558

Code is on github.

Fictional taxa

Anyone who works with taxonomic databases is aware of the fact that they have errors. Some taxonomic databases are restricted in scope to a particular taxon in which one or more people have expertise, these then get aggregated into larger databases, which may in turn be aggregated by databases whose scope is global. One consequence of this is that errors in one database can be propagated through many other databases.

As an example (for reasons I can't remember), I came across the name "Panisopus" (in the water mote family Thyasidae) but was struggling to find any mention of the taxonomic literature associated with this name. If you Google Panisopus the first two pages are full of search results from ITIS, EOL, GBIF, ZipCodeZoo, all listing several species in the genus, and sometimes taxonomic authorities, but no links to the primary literature. If you search BHL for Panisopus you get nothing, nothing at all. It's as if the name didn't exist.

Turns out, that's exactly the point. The name doesn't exist, other than in the various databases that have consumed other databases and recycled this fictional taxon. After some Googling of author's names it became clear that "Panisopus" is probably a misspelling of "Panisopsis", which according to ION was published in:

Viets, K. (1926) Eine nomenklatorische Aenderung im Hydracarinen-Genus Thyas C. L. Koch. Zool Anz Leipzig, 66: 145--148

I can't verify this because this article is not available online. But to give one example, ITIS lists the name "Panisopus pedunculata Keonike, 1895" (TSN 83185). This name should be, as far as I can tell, Panisopsis pedunculata (Koenike, 1895), based on Mitchell, 1954 (http://biostor.org/reference/104266, http://dx.doi.org/10.5962/bhl.title.3110) who on page 36 states:

Mitchell

Note that Panisopsis pedunculata was originally described in a different genus (Koenike 1895 preceeds the publication of the genus name by Viets in 1926). We can locate Koenike's original publication "Nordamerikanische Hydrachniden" in BHL, which I've added to BioStor http://biostor.org/reference/104265, and the original description appears on p. 192 as Thyas pedunculata (note that ITIS misspells the author's name Koenike [o and e transposed], as well as omitting the parentheses around the name).

What I find a little alarming (if not surprising) is that the entirely fictional genus "Panisopus" its accompanying species have ended up in numerous taxonomic databases, and these databases consistently appear in the top Google searches for this name. The good news is that it's becoming increasingly easy to discover these errors, in part because more and more taxonomic literature is coming online, making it possible for users to investigate matters for themselves, rather than rely on unsupported statements in taxonomic databases. I'm continually amazed by how little evidence most taxonomic databases provide for any of the assertions that they make. If a database includes a name, I want some evidence that the name is "real". Show me the publication, or at least give me a citation that I can follow up. I can't take these databases on blind faith, because demonstrably they are replete with errors. Ironically, one measure of success in the Internet age is being in the top 10 hits for a Google search. Now, if the top ten hits are all taxonomic databases I get very, very nervous. It's a good sign the name only exists in those databases.

BHL to PDF workflow

Just some random thoughts on creating searchable PDFs for article extracted from BHL.

Workflow

Taxonomy and the nine billion names of God

In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end. As the understandably skeptical programmers leave having completed their task, they look up into the sky and notice that "overhead, without any fuss, the stars were going out."

Leaving aside the delicious irony that arises if we recast this story with the monks replaced by taxonomists, much of our work with taxonomic names seems to be enumerating endless permutations of the same names. Part of the problem is the way some databases store and provide access to names.


The simplest way to represent a taxonomic name is to just have the name (the "canonical name"), without additional bits such as the taxonomic authority. In my view, any taxonomic database that serves names should provide the canonical name. I'm not arguing that they shouldn't provide taxonomic authority information (ideally separately, but could also be as part of a canonical name + authority string), I just want them to also provide just the canonical name. For some reason this seems to upset people (e.g., this thread on the TDWG mailing lists), so let me explain why I think this matters.

Most people use taxonomic names without the authority (just Google a taxonomic name with and without it's authority and compare the number of hits). So, if your goal is to be of service to your users, make sure you provide the canonical name.

Then there is the issue of integrating data from different sources. The more parts to the name the more scope there is for ambiguity. For example, my first ever publication was a description of a new species of peacrab, Pinnotheres atrinicola, published in:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904

If we look for this name in ION we discover three records:

Pinnotheres atrinacolaurn:lsid:organismnames.com:name:1192320
Pinnotheres atrinicolaurn:lsid:organismnames.com:name:371872
Pinnotheres atrinicola Page 1983urn:lsid:organismnames.com:name:371873


Two are duplicates of "Pinnotheres atrinicola", with and without the authority, one is a misspelling ("Pinnotheres atrinacola"). Given just the name we already see that it's easy for people to get the spelling wrong and generate lexical variants.

If we now add the authority we get more potential for variation. ION write the authority as "Page 1983" (no comma), but other databases such as WoRMS write it as Page, 1983 (with comma). So we now have two variations of the name, and two for the authority, so 4 possible strings if we include both name and authority. This combinatorial explosion means that we can rapidly generate lots of strings that are fundamentally the same.

I'm not arguing that taxonomic authorities aren't useful, and I want them wherever they are known, but insisting that databases serve name + authority to the exclusion of just the canonical name is a recipe for disaster. One could argue that users can parse the string into name and authority components, but that's a headache (just take a look at taxon-name-processing for details). Why make users go through hoops to get basic information?

Another reason I'm wary of taxonomic authority strings is that people don't always understand the conventions. For example, in my previous post I used the following example for names that differed in authority string:

  • Demansia torquata Günther 1862
  • Demansia torquata (Günther, 1862)

The use of parentheses seems a small difference, but (a) it means the strings are different, and (b) the presence or absence of parentheses changes the meaning of the authority. In this example, Demansia torquata Günther 1862 means that Günther is the original author of the name Demansia torquata, and so if I search Günther's publications from 1862 for "Demansia torquata" I will find that name. Demansia torquata (Günther, 1862), on the other hand, means that Günther originally described this species in 1862, but he placed it in a different genus, so my search for "Demansia torquata" in 1862 is likely to be fruitless. So, if the authority is actually (Günther, 1862) but a database tells me it's Günther, 1862 I'd be wasting my time looking for the name in 1862.

As it turns out, this snake was originally described as Diemansia torquata (see "On new species of snakes in the collection of the British Museum" http://biostor.org/reference/50221). The genus name Diemansia differs from Demansia, hence (Günther, 1862) should be correct, but it looks like Diemansia and Demansia are just some of the variations of the same snake genus (see for example http://biodiversitylibrary.org/page/22393791). *Sigh*

Variation in taxonomic authority extends beyond parentheses. In a post on clustering strings I used examples of taxonomic authorities for the genus Helicella:

Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821

There are six different strings here which correspond to three different authorities. In this example the name Helicella is a homonym (same name used for different taxa) so having the taxonomic authority can help decide which name is actually meant, but people can't seem to agree on how to spell the authority names, and in other cases they might not agree on dates of publication, hence we get variations such as those above. Even when authorities are useful, they come at a cost. And that's not even considering chresonyms where the authority isn't the original author, but instead is a form of citation of the use of a name.

All of this variation is a cause of ambiguity, and when we combine permutations of taxonomic names and taxonomic authorities, things start to get messy. Indeed, I'd argue that projects such as the Global Names Index (GNI) are essentially doing what Arthur C. Clarke's monks were doing, trying to capture near endless permutations of the same names. Given this, it seems crazy not to try and keep things as simple as possible. In the vast majority of cases I want the name, I don't want the rest of the cruff attached to it. Taxonomic authorities are really just proxies for citation, so lets focus on getting that information linked to names, and stop making life difficult for users.

Visualising differences between classifications using cluster maps

As part of a project to build a tool to navigate through taxonomic names and classifications I've become interested in quick ways to compare classifications. For example, EOL has multiple classifications for the same taxon, and I'd like to quickly discover what the similarities and differences are.

One promising approach is to use "cluster maps", a technique described by Fluit et al. (see Aduna Cluster Map for an implementation):

Fluit, C., Sabou, M., & Harmelen, F. (2006). Visualizing the Semantic Web. (V. Geroimenko & C. Chen, Eds.) (pp. 45–58). Springer Science + Business Media. doi:10.1007/1-84628-290-X_3 (see also http://www.cs.vu.nl/~frankh/abstracts/VSW05.html)

Cluster map details

Cluster maps can be thought of as fancy Venn Diagrams, in that they can be used to depict the overlap between sets of objects. The diagram is a graph with two kinds of nodes. One represents categories (in the example above, file formats and search terms), the other represents sets of objects that occur in one or more categories (in the example above, these are files that match the search terms "rdf" and "aperture").

I've cobbled together a crude version of cluster maps. For a given taxon (e.g., a genus) I list all the immediate sub-taxa (e.g., species) in each classification in EOL, and then find the sets of sub-taxa that are shared across the classification sources (e.g., ITIS, NCBI, etc.) and those that are unique to one source. I then create the cluster map using Graphviz. Inspired by the hexagonal packing used by Aduna, I've done something similar to display the taxa in each set. Adding these to the output of Graphviz required a little fussing with. First I get Graphviz to output the graph in SVG, then I load the SVG into a program that locates each node in the graph and inserts SVG for the packed circles (given that SVG is XML this is fairly straightforward).

As an example, consider the genus Demansia (http://eol.org/pages/34967/overview). EOL reports four classifications for this genus. Below is a cluster map for this genus:

34967

This diagram show that, for example, the Catalogue of Life (CoL) and Reptile databases share 4 names, these databases share three other names with ITIS. All databases have names unique to themselves, one database (NCBI) is completely disconnected from the other three databases.

One important caveat here is that I'm mapping the scientific names as returned by EOL, and in many cases these contain the taxonomic authority. This is a major headache, prompting this outburst:


If we clean the names by removing the taxonomic authority the clusters overlap rather more:
Demansia
Now we see that only ITIS and the Reptile Database have unique names. This is one reason why I get stroppy when taxonomists start saying databases shouldn't have to supply cleaned "canonical" names. If the names have authorities then I have to clean them, because in many cases the authorities (while useful to know) are inconsistent across databases. For example:

  • Demansia olivacea GRAY 1842 versus Demansia olivacea (Gray, 1842)
  • Demansia torquata GÜNTHER 1862 versus Demansia torquata (Günther, 1862)

Taxonomic authorities are frequently misspelt, and people seem confused about when to use parentheses or not. Databases should spare the user some pain and provide clean names (and authority strings separately where they have them).

The visualisation is still incomplete (I need to make it interactive), but it shows promise. The names that are unique to one database are usually worth investigating. In some cases they are names other databases regard as synonyms, in other cases they represent spelling variations. The goal of this visualisation is to highlight the names that the user might want to investigate further.

Using a zoomable treemap to visualise a taxonomic classification

One visualisation method I keep coming back too is the treemap. Each time I experiment with them I learn a little bit more, but I usually end up abandoning them (with the exception of using quantum treemaps to display bibliographic data). But they keep calling me back.

My latest experiment builds on some earlier thoughts on quantum treemaps, but tackles two issues that have kept bugging me. The first is that quantum treemaps are limited to hierarchies that are only two levels deep (e.g., family → genus → species). This is because, unlike regular treemaps where you are slicing and dicing a rectangle of predetermined size, when you construct a quantum treemap you don't know how big it will be until you've made it (this is because you want to ensure that every item in the hierarchy can be displayed at the same size, and fitting them in may require you to tweak the size of the treemap). Given that taxonomic classifications have > 2 levels this is a problem. One approach is to construct quantum treemaps for the lower parts of the classification, then pack those into a larger rectangle. This is an instance of the packing problem. After Googling for a bit I came up across this code for packing rectangles, which was easy to follow and gave reasonable results.

The second problem is that I want the treemap to be interactive. I want to be able to zoom in and out and navigate around the treemap. After more Googling, I came across the Zoomooz.js library which makes web page elements zoom (for a pretty mind-blowing example of what can be done see impress.js), but I decided I want to work with SVG. After playing with examples from Keith Wood's jQuery SVG plugin I started to get the hang of creating zoomable visualisations in SVG.

Here's a video of what I've come up with so far (you can see this live at http://iphylo.org/~rpage/zoomrect/primates.html). This is an interactive display of the Catalogue of Life 2010 classification of primates, with images from EOL. It's crude, there are some obvious issues with redrawing images, labels, etc., but it gives a sense of what can be done. With care this could probably be scaled up to handle the entire Catalogue of Life classification. With a bit more care, it could probably be optimised for the iPad, which would be a fun way to navigate through the diversity of life.

Discovering species descriptions in digitised newspapers: Trove and The Brisbane Courier


While exploring ways to visually compare classifications I came across the Australian snake name Demansia atra, and ended up reading a series of papers in the Bulletin of Zoological Nomenclature discussing the status of the name (more fun than it sounds, trust me). For example, Smith and Wallach Case 2920. Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes): proposed conservation of the specific name asked the ICZN to conserve the name, whereas Shea On the proposed conservation of the specific name of Diemenia atra Macleay, 1884 (currently Demansia atra; Reptilia, Serpentes) argued that Hoplocephalus vestigiatus was the correct name for the snake (OK, perhaps not that much fun.)

The reason I bring this up is that the original description of Hoplocephalus vestigiatus was published in an Australian newspaper, the The Brisbane Courier, 13 September 1884 (!). This newspaper has been digitised and is available in Trove, a digital archive hosted by the National Library of Australia. The description of Hoplocephalus vestigiatus appears in an account of a meeting of the Royal Society of Queensland: http://trove.nla.gov.au/ndp/del/article/3434083.

The Trove newspaper archive has both scanned images and OCR text, rather like BHL, but also enables users to correct OCR errors. The original text looked like this:

Mr De Vis then read a communication en
titled " Desciiptioua of >icw Snakes' -
This papei fe ive the descriptions of four
udditions to our Austi alian snake fauna,
and was prefaced by a synopsis and ditfii ential
characters of the genoa Huplo fjihaliui, to
which two of the anakca were referred-a genus
which Mr De Via stated w as out of all propor
tion larger than any other of anakea in Aus
traha, and, though consisting for the most
deadly reptile of Queensland-the brown
banded or "tiger snake " The new sn ikes were
Hoplocephalus snlcans-the furrow snake, so
called from the faculty the reptile has of con
verting ita ventral surface into a continuous
furrow, forwaidcd by Mr. C W de Burgh
Birch from the Mitchell district Hnplore
phalus lestigialiis, the foot punt snake, a name
of the white mai kings upon its back to tracks
of feet, Cacoplm, Wari o from Warroo station,
in the Port Curtía district, where it was
collected by Mi Llackman, one of the mern
bcis of the soeietj , and IJrarln/ioma Snther
lundi, a snakefrom Carl Creek, Norman River,
dcdicatcel to Mi J Sutherland, of normanton,


A few quick edits in Trove and it looks like this:

Mr. De Vis then read a communication en-
titled "Descriptions of New Snakes."—
This paper gave the descriptions of four
additions to our Australian snake fauna,
and was prefaced by a synopsis and differential
characters of the genus Hoplocephalus to
which two of the snakes were referred—a genus
which Mr. De Vis stated was out of all propor-
tion larger than any other of snakes in Aus-
tralia, and, though consisting for the most
deadly reptile of Queensland—the brown
banded or "tiger snake." The new snakes were
Hoplocephalus sulcans—the furrow snake, so
called from the faculty the reptile has of con-
verting its ventral surface into a continuous
furrow, forwarded by Mr. C. W. de Burgh
Birch from the Mitchell district . Hoploce-
phalus vestigitatus, the foot-print snake, a name
said to be suggested by the fancied resemblance
of the white markings upon its back to tracks
of feet; Cacophis Warro, from Warroo station,
in the Port Curtis district, where it was
collected by Mr. Blackman, one of the mem-
bers of the society; and Brachysoma Suther-
landi, a snake from Carl Creek, Norman River,
dedicated to Mr. J. Sutherland, of Normanton.
Searching Trove for Hoplocephalus I discovered a number of articles on snakes, some of which have also had their OCR text corrected, a measure of the success the project has had in engaging users. Trove has come up several times in discussions abut OCR correction and BHL, but this is the first time I've taken a closer look — I didn't expect to find species descriptions in an Australian newspaper.

Linking NCBI taxonomy to GBIF


In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:

<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Can you trust EOL?

There's a recent thread on the Encyclopedia of Life concerning erroneous images for the crab Leptograpsus. This is a crab I used to chase around rooks on stormy west-coast beaches near Auckland, so I was a little surprised to see the EOL page for Leptograpsus looks like this:

Leptograpsus

The name and classification is the crab, but the image is of a fish (Lethrinus variegatus). Perhaps at some point in aggregating the images the two taxa, which share the abbreviated name "L.variegatus" got mixed up.

Now, errors like this are bound to happen in a project the size of EOL, and EOL has some pretty active efforts to clean up errors (e.g., the Homonym Hunters). But what bothers me about this example is the prominent label Trusted that appears below the image. If I look at all the images for Leptograpsus on EOL, I see "trusted" images for fish. All images of the crab (i.e., the real Leptograpsus) are labelled "unreviewed" and implicitly "untrusted":

Leptograpsus2

If you are going to claim something is "trusted" you need to be very careful. The images of the fish may well come from a trusted source (FishBase), and FishBase's assertion that the image is of Lethrinus variegatus may well be "trusted", but I certainly can't trust the assertion made by EOL that this image depicts a crab.

In this example the error is easy to spot (if you know that crabs and fish are different), but what if the error was more subtle? Or what if you are using EOL's API and explicitly asking for only content you can trust? Then you get the fish images (see https://gist.github.com/2850321).

If I can't trusted "trusted" then EOL has a problem. One way forward is to unpack the notion of "trust" and make sure the user knows what "trusted" means. In this case there are at least two assertions being made:
  1. This image is of a fish (made by FishBase)
  2. This image is of a crab (made by EOL)

EOL needs to make clear what assertions are being made, and which ones it is stating can be "trusted". Ideally it also needs to move away from blanket assertions of "trusted" versus not trusted, because that's far too coarse (just because FishBase knows about fish I'm not sure I'm going to put equal trust that every image it contains has been correctly identified). Trust is something that is conferred by users and acquired over time, not something to be simply asserted.

The GBIF classification is broken — how do we fix it?

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help link to GBIF occurrence records.

The GBIF backbone classification ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.

On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (http://biostor.org/reference/88554, see also http://dx.doi.org/10.1080/00222939208677383) part of the classification looks like this:

└Animalia
└Annelida
└Polychaeta
└Sabellida
└Serpulidae
└Hipponyx
Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.

Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of Hipponyx and Hipponix under the genus Hipponix, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, Hipponix credited to DeFrance 1819 Journ. de Physique, 88, 217, and Hipponyx credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (Mémoire sur un nouveau genre de mollusque) and Bull. Sci. Soc. philom. Paris, 8. (Sur un nouveau genre de coquilles (Hipponix)). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling Hipponix (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).

OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).



So, bit of a mess, but this still doesn't deal with Hipponyx being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a specimen record from Te Papa labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:

Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. doi:10.1080/00288306.1971.10426332

with the following abstract:
An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.

The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.

The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.

The author of the paper (Charles Fleming) argues that Hipponyx inexpectata, regarded as a mollusc by its describer (Marjorie K. Mestayer, see Notes on New Zealand Mollusca. No. 4.) is actually a worm, and he moves it to the genus Temporaria.

So it seems that the reason Hipponyx has ended up being a worm in the GBIF classification is due to this synonymy.

Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think Hipponyx is a worm). As has been said several times before, most eloquently by David Shorthouse, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.

EOL challenge draft proposal

In the spirit of the Would you give me a grant experiment? [1] here's the draft of a proposal I'm working on for the Computable Data Challenge. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to link names to the primary literature. Comments welcome (the proposal deadline is next week).

The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or click here.



1. The answer from NERC was a resounding "no".

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this:

Obsolete

So the the sequence is hidden. You can retrieve it by clicking on the obsolete version link, but by default it is hidden.

It's an extraordinary state of affairs that a huge slice of fundamental biodiversity data has been effectively "pulled" from view.

UpdateSujeevan Ratnasingham from iBOL has pointed out that the sequence I'd used above (GU679674) was not one of the ones hidden by NCBI, rather it was suppressed at the request of the investigator (which I'd have realised if I'd paid more attention to the screenshot). HQ918317 is an example of a BOLD record that was suppressed:

Hq