Today I finally got a project out the door. BioStor is my take on what an interface to theBiodiversity Heritage Library (BHL) could look like. It features the visualisations I've mentioned in earlier posts, such as Google maps based on extracted localities, and tag trees. It also has a modified version of my earlier BHL viewer.
There are a number of ideas I want to play with using BioStor, but the main goal this site is to provide article-level metadata for BHL. As I've discussed earlier (see also Chris Freeland's post But where are the articles?), BHL has very little article-level metadata, making searching for articles a frustrating experience. BioStor aims to make this easier by providing an OpenURL resolver that tries to find articles in BHL.
BioStor supports the OpenURL standard, which means it can be used from
within EndNote and Zotero. Web sites that support COinS (such as Drupal-based Scratchpads and EOL's LifeDesks) can also be uses BioStor (see http://biostor.org/referrer.php for details).
My approach to finding articles in BHL is to take existing metadata from bilbiographies and databases, and use this to search BHL using techniques ranging from reasonably elegant (Smith-Waterman alignment on words to match titles) to down-and-dirty regular expression matching. Since this metadata may contain errors, BioStor provides basic editing tools (using reCAPTCHA rather than user logins at this point).
There's much to be done, the article finding is somewhat error-prone, and the search requires a local copy of BHL, and mine is rather out of date. However, it is a start.
To get a flavour of BioStor, try browsing some references:
http://biostor.org/reference/1
http://biostor.org/reference/4
http://biostor.org/reference/12
or view information for a journal:
http://biostor.org/issn/0007-1498
or an author:
http://biostor.org/author/41
http://biostor.org/author/16
or a taxon name:
http://biostor.org/name/Atelophryniscus%20chrysophorus
BHL interface ideas
I've been buried in programming (and it's exam time at Glasgow) so I've not blogged for a month (gasp). I've been playing with ways to visualise Biodiversity Heritage Library content for a while (click here for a list of previous posts), and have occasionally surfaced to tweet a screenshot via twitpic. The more I play with the BHL content the more I think it's a gold mine, and that many of the ideas I played with for my ill-fated Elsevier Challenge entry (website here, background paper at hdl:10101/npre.2009.3173.1) are taking on a new life with this project.
I'm hoping to release my BHL article finding and visualising web site by the end of the month, but meantime I'm gathering the screenshots here.
The first shows a Google map generated from latitude and longitudes extracted from OCR text using some simple regular expressions from page 7705952 in the BHL.There's quite a bit of latitude and longitude information in BHL, and that's before trying georeferencing tools.
The idea is to display this map next to the article so that user get's an immediate sense of what region in the world the article covers, such as this article about Riekia wasps:

I'm also interested in useful ways to display search results. Here's an experiment using TileBars to visualise how relevant a search result is. The width of the bar is a function of how many pages are in the article, the vertical stripes indicate pages that have the search term. The idea is to get a quick visual impression of whether the article mentions the term in parsing, or treats it in some detail.

TileBars were developed by Marti Hearst, whose web site has some great resources. Partly inspired by her BioText projec, as well as the thumbnail page display in JSTOR I'm now experimenting with showing thumbnails in search results. For example, here's a search for the deep sea octopus Graneledone pacifica, showing two articles:

I display thumbnails for pages that (a) have the name on the page, and (b) have what look like figure captions on them. The idea is that an article that figures a taxon is likely to be a fairly important article to look at, so displaying thumbnails will highlight those articles. The second article in the search results is the paper that published the name Graneledone pacifica, and the figures illustrate the taxon.
These are all pretty rough, but they give some idea of what I've been working on the last month.
I'm hoping to release my BHL article finding and visualising web site by the end of the month, but meantime I'm gathering the screenshots here.
The first shows a Google map generated from latitude and longitudes extracted from OCR text using some simple regular expressions from page 7705952 in the BHL.There's quite a bit of latitude and longitude information in BHL, and that's before trying georeferencing tools.
<

The idea is to display this map next to the article so that user get's an immediate sense of what region in the world the article covers, such as this article about Riekia wasps:

I'm also interested in useful ways to display search results. Here's an experiment using TileBars to visualise how relevant a search result is. The width of the bar is a function of how many pages are in the article, the vertical stripes indicate pages that have the search term. The idea is to get a quick visual impression of whether the article mentions the term in parsing, or treats it in some detail.

TileBars were developed by Marti Hearst, whose web site has some great resources. Partly inspired by her BioText projec, as well as the thumbnail page display in JSTOR I'm now experimenting with showing thumbnails in search results. For example, here's a search for the deep sea octopus Graneledone pacifica, showing two articles:

I display thumbnails for pages that (a) have the name on the page, and (b) have what look like figure captions on them. The idea is that an article that figures a taxon is likely to be a fairly important article to look at, so displaying thumbnails will highlight those articles. The second article in the search results is the paper that published the name Graneledone pacifica, and the figures illustrate the taxon.
These are all pretty rough, but they give some idea of what I've been working on the last month.
Labels:
BHL
,
twitter
,
visualisation
Tag trees: displaying the taxonomy of names in BHL
I've added a feature to my Biodiversity Heritage Library viewer that should help make sense of the names found on a page. Until now I've displayed them as a list of "tags", which ignores the relations among the names. Based on some code I'd developed for my e-Biosphere 09 challenge entry I've added a "tag tree" that displays the classification of the names found on a BHL page:

The idea is that a set of names can make much more sense if you know what kind of organism they are referring to. For example, I don't know what Onetes is, but if I look at BHL page 2298380 I can see that it's an insect:

The names in gray don't occur on the page, but do occur in the tree that links those names (the latter are highlighed in black). The tag tree can be useful for separating out host and parasite, e.g. BHL page 2298491 is about a flea and it's mammalian hosts:

The tag tree can also flag names that might be mistaken, such as those found on page 2298330:

This page has names of some grasshoppers from Madagascar, as well as the name of a butterfly (Tsaratanana), which seems a little odd. Looking at the text, we discover that "Tsaratanana" is Mont. Tsaratanana a mountain in Madagascar. It would be fun to develop tools to annotate such cases so that somebody looking for the butterfly won't be presented with this page.
How it works
The inspiration for this tag tree came from several sources. David Remsen has often used an example of finding a fly name in the middle of a book on birds as being of interest, and the NCBI have a subtree view of taxa in a PubMed article. My own tag tree is constructed by finding for each name the ancestor-descendant path in a local, modified copy of the Catalogue of Life database, then assembling those paths into a tree. Because not all the names on a BHL page are in the Catalogue of Life, there may be names that aren't classified. These are simply listed below the tag tree (see image above).

The idea is that a set of names can make much more sense if you know what kind of organism they are referring to. For example, I don't know what Onetes is, but if I look at BHL page 2298380 I can see that it's an insect:

The names in gray don't occur on the page, but do occur in the tree that links those names (the latter are highlighed in black). The tag tree can be useful for separating out host and parasite, e.g. BHL page 2298491 is about a flea and it's mammalian hosts:

The tag tree can also flag names that might be mistaken, such as those found on page 2298330:

This page has names of some grasshoppers from Madagascar, as well as the name of a butterfly (Tsaratanana), which seems a little odd. Looking at the text, we discover that "Tsaratanana" is Mont. Tsaratanana a mountain in Madagascar. It would be fun to develop tools to annotate such cases so that somebody looking for the butterfly won't be presented with this page.
How it works
The inspiration for this tag tree came from several sources. David Remsen has often used an example of finding a fly name in the middle of a book on birds as being of interest, and the NCBI have a subtree view of taxa in a PubMed article. My own tag tree is constructed by finding for each name the ancestor-descendant path in a local, modified copy of the Catalogue of Life database, then assembling those paths into a tree. Because not all the names on a BHL page are in the Catalogue of Life, there may be names that aren't classified. These are simply listed below the tag tree (see image above).
Labels:
BHL
,
Catalogue of Life
,
tag tree
,
tags
Subscribe to:
Posts
(
Atom
)