Search this keyword

Tree of Life 0.1 - annotating the NCBI taxonomy

Last week I was at the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab, run by KnowInnovation.com/. It was an interesting experience, essentially a structured week of brainstorming ideas.

One thing I came away with is the feeling that our notions of the "tree of life" are fuzzy, contradictory, and often probably unobtainable. It's tempting to imagine all sorts of wonderful visualisations, and loose sight of building something that is useful. Perhaps it's time instead to think of "Tree of Life version 0.1".

Imagine taking the NCBI taxonomy as a starting point. Yes it's incomplete, and has almost no fossils, but it's freely available and linked to a lot of data. Let's use a Google Maps-like viewer along the lines I explored earlier this year.

Then add annotation "tracks" to the tips. As a first pass these could be taken from the NCBI LinkOut service, such as the NCBI-Wikipedia mapping http://iphylo.org/linkout.

Ncbi 1

The NCBI tree is a classification rather than a phylogeny, so we could add greater phylogenetic content by linking to phylogenetic databases, such as TreeBASE and PhyLoTA. Imagine clicking on a node in the NCBI taxonomy and seeing a display of all the phylogenies centred on that node:

Ncbi 02

Now we have a way to navigate a large tree, view annotations, and display phylogenetic trees. All of this could be done fairly easily. The key is to have services keyed by the NCBI tax_id used to identify nodes on the tree.

Among the next steps would be to add additional "tracks", perhaps based on curated links analogous to the wiki-based NCBI-Wikipedia mapping. For example, very basic habitat data (marine or terrestrial) could be added, or geography, or host relationships (could be based in part on the data already in GenBank).

Given that the NCBI tree continues to grow, subsequent versions could be released as the tree changes. Or we could "fork" the NCBI tree and start to refine it based on phylogenetic information, and add taxa that aren't in the genome databases (these taxa will need consistent identifiers so we can map annotations on to them as well). Perhaps we could use something like Git to manage this tree, and to handle the necessary merging of updated versions of the NCBI tree. People could edit the tree, or indeed fork it and come up with their own.

Logo tmp reasonably smallThere are lots of ways to visualise trees (see TreeVis.net for some great examples), but what I'm after is a tool that is useful, that gives us a sense of what we know and what we don't. I suspect that one of the reasons we've struggled with visualising the tree of life is that there are lots of different notions about what it's for. In this case, I want a tool to navigate data about organisms, one that we can easily add annotations too.


I am not a number...I am an "ideator"

As part of the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab that I took part in earlier this week I had an assessment of my "problem solving style" carried out using a service called FourSight. I'm hugely sceptical of attempts to classify people (I'm unique, aren't I?), but I took the test and turns out am an "Ideator". FourSight's web site defines an Ideator as one who:

  • Likes to look at the big picture
  • Enjoys toying with ideas and possibilities
  • Likes to stretch his or her imagination
  • Enjoys thinking in more global and abstract terms
  • Takes an intuitive approach to innovation
  • May overlook details

Details schmetails, it's the big picture folks!

Ideators are:

  • Playful
  • Imaginative
  • Social
  • Adaptable
  • Flexible
  • Adventurous
  • Independent

Liking this. OK, how do you care for ideators? We need:

  • Room to be playful
  • Constant stimulation
  • Variety and change
  • The big picture

That's right, leave us alone to think our great thoughts. Result! Then there's this totally superfluous category "Ideators annoy others by...".

  • Drawing attention to themselves
  • Being impatient when others don’t get their ideas
  • Offering ideas that are too off-the-wall
  • Being too abstract
  • Not sticking to one idea

Utter, utter, nonsense. Look at my blog, it's full of ideas that have been developed fully... oh, wait. And, maybe the blog thing is a bit attention seeking, and I guess saying "it sucks" is a tad impatient, and saying to a crowd of taxonomists "haven't we basically found every species bigger than my coffee cup?" is a little off-the-wall.

Good job these psychometric thingies are clearly bogus.

Correcting OCR using hOCR in Firefox

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output. Given my interest in OCR and the Biodiversity Heritage Library I decided to take it for a spin.

moz-hocr-edit uses the hOCR, which is a format for representing the output of OCR software, and is used by tools such as OCRopus (you can see the public specification for hOCR here). Basically it's a microformat, that is, it's HTML with some additional tags. Given some hOCR, moz-hocr-edit enables you to edit the OCR output line-by-line.

Demo
I've created a simple demo based upon Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation. For the demo to work you will need to use the Firefox web browser with the moz-hocr-edit installed.

  1. Go to http://dl.dropbox.com/u/639486/hocr/80780.html
  2. You will see a simple HTML representation of the OCR text from "Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation". I created this HTML from the original ABBYY FineReader XML from the Internet Archive.
  3. On the bottom right-hand of the Firefox browser window you should see hOCR. Click on it and select "Edit this hOCR document":
    Statusbar
  4. Firefox will open a new tab that will look something like this:
    Screenshot
  5. You can now edit individual lines of text, and see your edits applied to the HTML below.
moz-hocr-edit is a neat little tool. With appropriate web server settings (and, as the tool's author Jim Garrison suggests, autoversioning) it could the basis of a great tool for correcting OCR errors in BHL.