Search this keyword

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

RefineGoogle Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.

The services I've implemented so far are:
  • EOL http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
  • NCBI taxonomy http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php
  • uBio FindIT http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php
  • WORMS http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php
  • GBIF http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php
  • Global Names Index http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php


To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".

How to use reconciliation services

Start a Google Refine session. Save the names below to a text file and open it as a new project.

Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae


You should see something like this:
Refine1

Click on the column header Names and choose ReconcileStart reconciling.

Refine2

A dialog will popup asking you to select a service.

Refine3

If you've already added a service it will be in the list on the left. If not, click the Add Standard Services... button at the bottom left and paste in the URL (in this case http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).

Once the service has loaded click on Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Refine4

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the Names column header and choose Edit columnAdd column based on this column.... A dialog box will be displayed:

Refine6

In the box labelled Expression enter cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

Refine7

You could also get the names uBio extracted by creating a column based on the values of cell.recon.match.name. To compare this with the original values, click on the Names column header and choose ReconcileActionsClear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

Refine8

These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).

Development notes
The code for these services is in Github (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per David Huynh's tip helped) things work pretty well.

Browsing TreeBASE using a genome browser-like interface

One of the things I find frustrating about TreeBASE is that there's no easy way to get an overview of what it contains. What is it's taxonomic coverage like? Is it dominated by plants and fungi, or are there lots of animal trees as well? Are the obvious gaps in our phylogenetic knowledge, or do the phylogenies it contains pretty much span the tree of life?

As part of my phyloinformatics course I've put together a simple browser to navigate through TreeBASE. The inspiration comes from genome browsers (e.g., the UCSC Genome Browser) where the genome is treated as a linear set of co-ordinates, and features of the genome are displayed as "tracks".

Hgt genome 596a ac7fe0

For my browser, I've used the order in which nodes appear in the NCBI tree as you go from left to right as the set of co-ordinates (actually, from top to bottom as my browser displays the co-ordinate axis vertically).

Browser

I then place each TreeBASE tree within this classification by taking the TreeBASE → NCBI mapping provided by TreeBASE and finding the "majority rule" taxon for each tree (in a sense, the taxa that summarises what the tree is about). Each tree is represented by a vertical line depicting the span of the corresponding NCBI taxon (corresponding to a "track" in a genome browser). Taking the majority-rule taxon rather than say, the span of the tree, makes it possible to pack the vertical lines tightly together so that they take up less space (the ordering from left to right is determined by the NCBI taxonomy).

If you mouse-over a vertical bar you can see the title of the study that published the tree. If you click on the vertical bar you'll see the tree displayed on the right (if your web browser understands SVG, that is). If you click on the background you will drill down a level in the NCBI classification. To go back up the classification, click on the arrow at the top left of the browser.

This is all very preliminary, but you can take it for a spin at http://iphylo.org/~rpage/phyloinformatics/treebase/.

Below is a short video walking you through some examples.

BLAST a sequence and get a tree

For this weeks sessions of my phyloinformatics course I'm developing some phylogeny tools. The first is a simple AJAX-based BLAST tool. I've always wanted a quick way to see a GenBank sequence in its phylogenetic context, so I've built a simple tool to that takes a GenBank accession number or GI number, submits a BLAST job, retrieves the sequences, aligns them using CLUSTALW, builds a quick and dirty neighbour-joining tree using PAUP*, then displays the tree using SVG (if your browser doesn't support this you won't see the tree). One use for this is to quikcly get a sense of whether an unnamed ("dark") taxon is related to sequences that have been identified.

Nothing fancy, but it was a chance to display the whole process in the browser without opening new windows or refreshing the page. Here's an example for the GenBank sequence FJ559186:



For the technically-minded, the calls to BLAST and the alignment and tree construction tools all use AJAX, and there's a simple Javascript timer to countdown the seconds that the NCBI BLAST web service estimates the BLAST job will take, before we poll NCBI to see if the job has in fact finished. The code is in GitHub.