Search this keyword

iEvoBio 2012 Challenge: Synthesizing phylogenies

0150The iEvoBio 2012 Challenge has been announced, and the topic is synthesizing phylogenies. The task:

Somewhere, buried in large sets of trees, lies a stunning new revelation, a baffling discovery, the answer to a longstanding controversy, or simply something not obvious to the naked eye. The mission of the 2012 iEvoBio challenge is to find those revelations, discoveries and answers within your own data and/or within one of the datasets provided by the challenge. What new scientifically interesting results can you pull from these trees, using any combination of techniques at your disposal?


The rules of this challenge are:
  1. The set of trees you use must have at least 10,000 leaves in total. Acceptable entries could be a set comprising 2,500 distinct trees covering the same four taxa, a single tree with 10,000+ leaves, or anything in between.
  2. Your results must be scientifically new.
  3. The data, or at least a description of the data, must be publicly available. If working with your own dataset, you must at least provide a summary of the data you used (see below for the minimum description that must be provided).
  4. The source code of any tool and/or method developed as part of your challenge submission must be publicly downloadable under an OSI-approved open-source license (or dedicated to the public domain) at the latest by the time of the conference.


For more details see the challenge site. Deadline for submission is June 25, 2012.

Yet more reasons to have specimen identifiers: annotating GenBank sequences

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated. For example, every GenBank sequence has a unique accession number, so if you and I are discussing sequence DQ055738, you and I can be sure we are talking about the same thing.

Sequence DQ055738 is interesting because Hua et al. A Revised Phylogeny of Holarctic Treefrogs (Genus Hyla) Based on Nuclear and Mitochondrial DNA Sequences (http://dx.doi.org/10.1655/08-058R1.1 - note the nice identifier we have for this article) have suggested this sequence (published in http://dx.doi.org/10.1554/05-284.1, another nice identifier) is misidentified. Given these identifiers we could construct various statements, such as:


DQ055738 -> published in -> doi:10.1554/05-284.1
DQ055738 -> annotated by -> doi:10.1655/08-058R1.1

(I've omitted the http:// stuff to keep things legible). Hua et al: state the following:

However, the tissue number of this specimen (LSUMZ H-19067) is similar to that of a specimen of H. versicolor (LSUMZ H-19077), which appears to have been processed at the same time (C. Austin, personal communication). Therefore, we hypothesize that the sequence data for H. gratiosa used by Smith et al. (2005) were actually from H. versicolor.

It would be nice if we had unique, resolvable identifiers for LSUMZ H-19067 and LSUMZ H-19077 so that we could construct statements linking the sequence, the publications, and the specimens. But we don't. Nor is it obvious how to find out anything more about LSUMZ H-19067 and LSUMZ H-19077. By contrast, for the DOI or the sequence accession I know how to get more information, in either human- or machine-readable form.

The acronym LSUMZ in this case is the Lousiana State University Museum of Natural Science Herpetology collection (http://biocol.org/urn:lsid:biocol.org:col:34806). Just to confuse matters, LSUMZ specimens in GBIF use LSU as the acronym for Lousiana State University Museum of Natural Science. Given that GBIF's data comes from LSU itself, it's odd (but not surprising) that there's a muddle about which acronym to use (it would be nice to clear this up, but then anybody building identifiers based on those acronyms is in for some heartbreak).

If I look at GBIF LSUMZ records there aren't specimens with the catalogue numbers H-19067 or H-19077. However, after a bit of poking around, and a helpful file from GBIF's Tim Robertson, I discovered that the LSUMZ herpetology tissue numbers (which is what the H-* codes actually are) are stored in GBIF, so I've found the corresponding specimens are http://data.gbif.org/occurrences/45716232 (LSU Herp 84850, LSUMZ HerpNet Tissue 19067) and http://data.gbif.org/occurrences/45710033 (LSU Herp 84862, LSUMZ HerpNet Tissue 19077). (Note that Hua et al. tell the reader that LSU 84850 = LSUMZ H-19067, but don't give the specimen code for LSUMZ H-19077).

Now I have some resolvable identifiers, so I could construct statements like:


DQ055738 -> voucher -> occurrences/45716232
DQ055738 -> voucher -> occurrences/45710033
|
+-> according to -> doi:10.1655/08-058R1.1

Let's skip over whether this is actually the best way to record the annotation, the point is we can now start to construct statements that can be linked to the wider world. If someone else has made statements about these specimens, and they used the GBIF URL, then we could aggregate those and learn more about these specimen and their associated sequences. Without globally unique, stable, resolvable identifiers we are left to flounder around in the bowels of various databases searching for something that may or may not be the object being discussed. Isn't it time we did something about this?

Making biodiversity data sticky: it's all about links

Who invented velcro?

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (preprint, published version) summarises some of the links between diverse kinds of biological data:
Model
While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:
Links
For example, the iPhylo Linkout project creates links between taxon concepts in NCBI and Wikipedia. The iTaxon project is a mapping between taxonomic names and publications. I've briefly explored mapping host-parasite relationships using GenBank, and I'm currently exploring the links between publications and specimens. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.