Search this keyword

iDigBio: You are putting identifiers on the wrong thing

LogoThe Integrated Digitized Biocollections (iDigBio) project aims to advance digitising US biodiversity collections. They recently published a GUID Guide for Data Providers. In the PDF document I read this:
It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only. (emphasis added)

My heart sank. There's nothing wrong with having identifiers for metadata (apart from inviting the death spiral that is metadata about metadata), but surely the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.

Now, identifiers for metadata can be useful. For example, there is a specimen of Parathemisto japonica in the National Museum of Natural History, Smithsonian Institution with the label "USNM 100988". The NMNH web site has a picture of the index card for this specimen:

Search php

This is an image of the metadata, not the specimen itself. We could link the metadata to this image, but of course we also want to link it to the actual specimen.

Specimens are the things we collect, preserve, dissect, measure, sequence, photograph, and so on. I want to link a specimen to the sequences that have been obtains from that specimen, I want to list the publications that cite that specimen, I want to be able to aggregate data on a specimen from multiple sources, I want to be able to add annotations including misidentifications, simple typos, or missing georeferencing.

Key to this is having identifiers for specimens. Identifiers for metadata about those specimens is not good enough. By analogy with bibliographic citation, one of the important decisions CrossRef made was that DOIs for articles identify the article, not the metadata about the article, or any of the different formats (HTML, PDF, print) and article may occur in. This means we can build databases about things and relationships (this article cites that one, these articles were authored by this person, etc.).

As it stands, if we don't have identifiers for specimens then we can't link data together. For example, the frog specimen "USNM 195785" is depicted in the image below (from EOL):

89351 orig

It is also listed in various papers in BioStor. In the absence of a globally unique identifier for this specimen how do I make these links? "USNM 195785" won't do because there are at least four specimens in the USNM with the catalogue number "195785". The GBIF occurrence id for this specimen (http://data.gbif.org/occurrences/244405570) would be an obvious candidate, were it not for the fact that GBIF has no concept of stable identifiers and its occurrence ids regularly change.

I confess I'm flabbergasted that iDigBio has avoid tackling the issue of specimen identifiers. If any museum wants to discover how its collection is being used to support science it will want to find the citations of its specimens in scientific papers and databases. This requires identifiers for specimens.

Elsevier articles have interactive phylogenies

Elsevier treeSay what you will about Elsevier, they are certainly exploring ways to re-imagine the scientific article. In a comment on an earlier post Fabian Schreiber pointed out that Elsevier have released an app to display phylogenies in articles they publish. The app is based on jsPhyloSVGand is described here. You can see live examples in these articles:

Matos-Maraví, P. F., Peña, C., Willmott, K. R., Freitas, A. V. L., & Wahlberg, N. (2013). Systematics and evolutionary history of butterflies in the “Taygetis clade” (Nymphalidae: Satyrinae: Euptychiina): Towards a better understanding of Neotropical biogeography. Molecular Phylogenetics and Evolution, 66(1), 54–68. doi:10.1016/j.ympev.2012.09.005
Poćwierz-Kotus, A., Burzyński, A., & Wenne, R. (2010). Identification of a Tc1-like transposon integration site in the genome of the flounder (Platichthys flesus): A novel use of an inverse PCR method. Marine Genomics, 3(1), 45–50. doi:10.1016/j.margen.2010.03.001
Sampleimg2Sampleimg3

NEXUS parser and tree viewer in Javascript

Following on from the SVG experiments I've started to put some of the Javascript code for displaying phylogenies on Github. Not a repository yet, but as gists, little snippets of code. Mike Bostock has created http://bl.ocks.org/ which makes it possible to host gists as working examples, so you can play with the code "live".

The first gist takes a Newick tree, parses it and displays a tree. You can try it at https://bl.ocks.org/d/4224658/.

The second gist takes a basic NEXUS file containing a TREES block and displays a tree (try it at http://bl.ocks.org/d/4229068/ ). You can grab examples NEXUS tree files from TreeBASE such as tree Tr57874.

NexusWhy am I doing this?
Apart from "because it's fun" there are two reasons. The first is that I want a simple way to display phylogenetic trees in web pages, and doing this entirely in the web browser (Javascript parses the tree and renders it in SVG) saves me having to code this on my server. Being able to do this in the browser opens up the opportunity to embed tree descriptions in HTML, for example, and have the browser render the tree. This means the same web page can have machine-readable data (the tree description) but also generate a nice tree for the reader. As an aside, it also shows that TreeBASE could display perfectly good, interactive trees without resorting to a Java appelet.

The other reason is that the web seems to be moving to Javascript as the default language, and JSON as the standard data format. Instead of large chunks of "middleware" (written in a scripting language such as Perl, PHP, or, gack, Java) which is responsible for talking to databases on the server and sending static HTML to the web browser, we now have browsers that can support sophisticated, interactive interfaces built using HTML and Javascript. On the server side we have databases that speak HTTP (essentially removing the need for middleware), store JSON, and use Javascript as their programming language (e.g., CouchDB). In short, it's Javascript, Javascript, everywhere.