Search this keyword

Megascience platforms for biodiversity information: what's wrong with this picture?

The journal Mycokeys has published the following paper:

Triebel, D., Hagedorn, G., & Rambold, G. (2012). An appraisal of megascience platforms for biodiversity information. MycoKeys, 5(0), 45–63. doi:10.3897/mycokeys.5.4302

This paper contains a diagram that seems innocuous enough but which I find worrying:

MycoKeys 005 045 g001

The nodes in the graph are "biodiversity megascience platforms", the edges are "cross-linkages and data exchange". What bothers me is that if you view biodiversity informatics through this lens then the relationships among these projects becomes the focus. Not the data, not the users, nor the questions we are trying to tackle. It is all about relationships between projects.

I want a different view of the landscape. For example, below is a very crude graph of the kinds of things I think about, namely kinds of data and their interrelationship:

Biodiversity

What tends to happen is that this data landscape gets carved up by different projects, so we get separate databases of taxonomic names, images, publications, and specimens (these are the "megascience platforms" such as CoL, EOL, GBIF). This takes care of the nodes, but what about the edges, the links between the data? Typically what happens is lots of energy is expended on what to call these links, in other words, the development of the vocabularies and ontologies such as those curated by TDWG. This is all valuable work, but this doesn't tackle what for me is the real obstacle to progress, which is creating the links themselves. Where are the "megascience platforms" devoted to linking stuff together?

When we do have links between different kinds of data these tend to be within databases. For example, Genbank explicitly links sequences to publications in PubMed, and taxa in the NCBI taxonomy database. All three (sequence, publication, taxon) have identifiers (accession number, PubMed id, taxon id, respectively) that are widely used outside GenBank (and, indeed, are the de facto identifiers for the bioinformatics community). Part of the reason these identifiers are so widely used is because GenBank is the only real "megascience platform" in the list studied by Triebel et al. It's the only one that we can readily do science with (think BLAST searches, think of the number of databases that have repurposed GenBank data, or build on NCBI services).

Genbank

Many of the questions we might ask can be formulated as paths through a diagram like the one above. For example, if I want to do phylogeography, then I want the path phylogeny -> sequence -> specimen -> locality. If I'm lucky the phylogeny is in a database and all the sequences have been georeferenced, but often the phylogeny isn't readily available digitally, I need to map the OTUs in the tree to sequences, I then need to track down the vouchers for those sequences, and obtain the localities for those sequences from, say, GBIF. Each step involves some degree of pain as we try and map identifiers from one database to those in another.

Phylogeography

If I want to do classical alpha taxonomy I need information on taxonomic names, concepts, publications, attributes, and specimens. The digital links between these are tenuous at best (where are the links between GBIF specimen records and the publications that cite those specimens, for example?).

Taxonomy

Focussing on so-called "platforms" is unfortunate, in my opinion, because it means that we focus on data and how we carve up responsibility for managing it (never mind what happens to data that lacks an obvious constituency). The platforms aren't what we should be focussing on, it is the relationships between data (and no, these are not the same as the relationships between the "platforms").

If I'd like to see one thing in biodiversity informatics in 2013 it is the emergence of a "platform" that makes the links the centre of their efforts. Because without the links we are not building "platforms", we are building silos.

iDigBio: You are putting identifiers on the wrong thing

LogoThe Integrated Digitized Biocollections (iDigBio) project aims to advance digitising US biodiversity collections. They recently published a GUID Guide for Data Providers. In the PDF document I read this:
It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only. (emphasis added)

My heart sank. There's nothing wrong with having identifiers for metadata (apart from inviting the death spiral that is metadata about metadata), but surely the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.

Now, identifiers for metadata can be useful. For example, there is a specimen of Parathemisto japonica in the National Museum of Natural History, Smithsonian Institution with the label "USNM 100988". The NMNH web site has a picture of the index card for this specimen:

Search php

This is an image of the metadata, not the specimen itself. We could link the metadata to this image, but of course we also want to link it to the actual specimen.

Specimens are the things we collect, preserve, dissect, measure, sequence, photograph, and so on. I want to link a specimen to the sequences that have been obtains from that specimen, I want to list the publications that cite that specimen, I want to be able to aggregate data on a specimen from multiple sources, I want to be able to add annotations including misidentifications, simple typos, or missing georeferencing.

Key to this is having identifiers for specimens. Identifiers for metadata about those specimens is not good enough. By analogy with bibliographic citation, one of the important decisions CrossRef made was that DOIs for articles identify the article, not the metadata about the article, or any of the different formats (HTML, PDF, print) and article may occur in. This means we can build databases about things and relationships (this article cites that one, these articles were authored by this person, etc.).

As it stands, if we don't have identifiers for specimens then we can't link data together. For example, the frog specimen "USNM 195785" is depicted in the image below (from EOL):

89351 orig

It is also listed in various papers in BioStor. In the absence of a globally unique identifier for this specimen how do I make these links? "USNM 195785" won't do because there are at least four specimens in the USNM with the catalogue number "195785". The GBIF occurrence id for this specimen (http://data.gbif.org/occurrences/244405570) would be an obvious candidate, were it not for the fact that GBIF has no concept of stable identifiers and its occurrence ids regularly change.

I confess I'm flabbergasted that iDigBio has avoid tackling the issue of specimen identifiers. If any museum wants to discover how its collection is being used to support science it will want to find the citations of its specimens in scientific papers and databases. This requires identifiers for specimens.

Elsevier articles have interactive phylogenies

Elsevier treeSay what you will about Elsevier, they are certainly exploring ways to re-imagine the scientific article. In a comment on an earlier post Fabian Schreiber pointed out that Elsevier have released an app to display phylogenies in articles they publish. The app is based on jsPhyloSVGand is described here. You can see live examples in these articles:

Matos-Maraví, P. F., Peña, C., Willmott, K. R., Freitas, A. V. L., & Wahlberg, N. (2013). Systematics and evolutionary history of butterflies in the “Taygetis clade” (Nymphalidae: Satyrinae: Euptychiina): Towards a better understanding of Neotropical biogeography. Molecular Phylogenetics and Evolution, 66(1), 54–68. doi:10.1016/j.ympev.2012.09.005
Poćwierz-Kotus, A., Burzyński, A., & Wenne, R. (2010). Identification of a Tc1-like transposon integration site in the genome of the flounder (Platichthys flesus): A novel use of an inverse PCR method. Marine Genomics, 3(1), 45–50. doi:10.1016/j.margen.2010.03.001
Sampleimg2Sampleimg3