Accounting Careers

Quick note about a tool I've cobbled together as part of the phyloinformatics course, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say Zootaxa and BioStor).

You can try the tool at http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.

Comments welcome. If you are looking for a source of text, papers in Zookeys or Zootaxa are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica is available at http://biostor.org/reference/97426.text.

The extraction tool can also be called as a web service using POST to get back the results in JSON.

As part of a postgraduate course here at the University of Glasgow I'm teaching five sessions on "phyloinformatics", which I've decided to define broadly enough to encompass most of biodiversity informatics.

Given that this module is being developed on the fly, and will make use of lots of little "toys" I've developed and discussed on this blog, I've decided to put the course notes online, along with the interactive demos and the source code. So, if you want to follow along for the next couple of weeks, here are the links:

Course home page
Course notes and exercises (currently just the introductory session)
Source code on GitHub (including code for my EOL iPad webapp)

Each course page supports comments (see the bottom of the page), so feel free to add comments, or suggestions. The notes are at a crude stage, and will be developed over the duration of the course (2 weeks). I'm also endeavouring to get all the source code for the demonstration apps into GitHub. None of these demos is polished, but they will hopefully provide some ideas for taking them further. There will be iSpecies-like mashups, iPad webapps, classification visualisations, TreeBASE search tools, geophylogenies and other phylogeny viewers.

Accounting Careers

Search this keyword

Extracting museum specimen codes from text

Open course on phyloinformatics

Blog Archive

Popular Posts

Labels