Search this keyword

Linking GBIF and Genbank

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).

So why undertake what is fast looking like a hopeless task? There are several reasons:
  1. GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.
  2. Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see hdl:10101/npre.2008.1760.1.
  3. Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced, we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind hdl:10101/npre.2009.3173.1.

As an example, below is the GBIF 1° density map for the frog Pristimantis ridens from GBIF, with the phylogeny from Wang et al.Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central Americahttp://dx.doi.org/10.1016/j.ympev.2008.02.021 layered over it. I created the KML tree from the corresponding tree in TreeBASE using the tool I described earlier. You can grab the KML for the tree here.

Density

As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the GBIF KML file with individual placemarks we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.

Gbif

One of these 15 GBIF records (http://data.gbif.org/occurrences/244335848) is for specimen USNM 514547, which is the voucher specimen for EU443175. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen http://data.gbif.org/occurrences/244335848 instead of the unresolvable and potentially ambiguous USNM 514547.

If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF.

Nogbif

Close inspection reveals that some of the specimens listed in the Wang et al. paper are actually in GBIF, but lack geographic coordinates. For example the OTU "Pristimantis ridens Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as http://data.gbif.org/occurrences/57919777/, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang et al. paper and the GenBank record for the sequence from this specimen EU443164 give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.

Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.

EOL Phylogenetic Tree Challenge

34106 130 130The Encyclopedia of Life have announced the EOL Phylogenetic Tree Challenge. The contest has two purposes:


It provides a testbed for the Evolutionary Informatics community to develop robust methods for producing, serving, and evaluating large, biologically meaningful trees that will be useful both to the research community and to broader audiences.

It enables the Encyclopedia of Life to organise the information it aggregates according to phylogenetic relationships; in other words, it provides a direct pipeline from research results to practical use.


First prize is a trip to iEvoBio 2012, this year in Ottawa, Canada. For more details visit the challenge website. There is also an EOL community devoted to this challenge.

Challenges are great things, especially ones with worthwhile tasks and decent prizes. EOL badly needs a phylogenetic perspective, so this is a welcome development.

But (there's always a but), I can't help feeling that we need something a little more radical. The tree of life isn't a tree. At deep levels it's a forest, and even at shallow levels things are a complicated tangle of gene trees. Sometimes the tree is clear, sometimes not, and some of this is real and some reflects our ignorance.

If you want a simple tree to navigate, then I'd argue that the NCBI tree is a pretty good start, and EOL already has this. What would be really cool is to have a way to navigate that makes it clear that phylogenetic knowledge has a degree of uncertainty, and that the "tree of life" might be better depicted as a set of overlapping trees. The mental image I have is of a collage of trees from different data sets, superimposed over each other, with perhaps an underlying consensus to help navigate. This visualisation could be zoomable, because in some ways the tree of life is fractal. Trees don't stop at species, as the wealth of barcoding and phylogeographic studies show. Given computational constraints (not to mention visualisation issues), I wonder whether there is an effective limit to the size of any one tree in terms of number of taxa. What varies is the taxonomic scope. So we could imagine a backbone tree based on slowly evolving genes, we zoom in and more trees appear, but at lower levels, and finally we hit populations and individuals, trees that may have 100's of samples, but a very narrow scope.

This is all rather poorly articulated, but I can't help wondering whether a phylogenetic classification will end up distorting the very thing we're trying to depict. It also looses connection with the underlying data (and trees), which for me is a huge drawback of existing classifications. There's no sense of why they are the way they are. There's a chance here to bring together ideas that have been kicking around in the phylogenetic community for a couple of decades and rethink how we navigate the "tree of life".

BLAST a sequence and get a tree and a map

I've updated the BLAST a sequence and get a tree tool described in a previous post to output additional details, such as a list of the sequences used to build the tree and some basic metadata (such as the taxon name, name of any associated host, publication, and geographic coordinates). If the sequences are geotagged, then you will also see a little map showing the localities. As ever, all this relies on SVG, so if you're browser doesn't support that out won't see much.

The example below is for the sequence EU399074, which falls in a cluster of "dark taxa"; in this case, DNA barcode sequences that haven't been properly labelled.

Blastmap