Search this keyword

Showing posts with label BioStor. Show all posts
Showing posts with label BioStor. Show all posts

New look Biodiversity Heritage Library launched


The new look Biodiversity Heritage Library has just launched. It's a complete refresh of the old site, based on the Biodiversity Heritage Library–Australia site. If you want an overview of what's new, BHL have published a guide to the new look site. Congrats to involved in the relaunch.

One of the new features draws on the work I've been doing on BioStor. The new BHL interface adds the notion of "parts" of an item, which you can see under the "Table of Contents" tab. For example, the scanned volume 109 of the Proceedings of the Entomological Society of Washington now displays a list of articles within that volume:

Newbhl
This means you can now jump to individual articles. Before you had to scroll through the scan, or click through page numbers until you found what you were after. The screenshot above shows the article "Three new species of chewing lice (Phthiraptera: Ischnocera: Philopteridae) from australian parrots (Psittaciformes: Psittacidae)". The details of this article have been extracted from BioStor, where this article appears as http://biostor.org/reference/55323. You can go directly to this article in BHL using the link http://www.biodiversitylibrary.org/part/69723. As an aside, I've chosen this article because it helps demonstrate that BHL has modern content as well as pre-1923 literature, and this article names a louse, Neopsittaconirmus vincesmithi after a former student of mine, Vince Smith. You're nobody in this field unless you've had a louse named after you ;)

BioStor has over 90,000 articles, but this is a tiny fraction of the articles contained in BHL content, so there's a long way to go until the entire archive is indexed to article level. There will also be errors in the article metadata derived from BioStor. If we invoke Linus's Law ("given enough eyeballs, all bugs are shallow") then having this content in BHL should help expose those errors more rapidly.

As always, I have a few niggles about the site, but I'll save those for another time. For noe, I'm happy to celebrate an extraordinary, open access archive of over 40 million pages. BHL represents one of the few truly indispensable biodiversity resources online.

BioStor in the cloud

CloudantQuick note on an experimental version of BioStor that is (mostly) hosted in the cloud. BioStor currently runs on a Mac Mini and uses MySQL as the database. For a number of reasons (it's running on a Mac Mini and my knowledge of optimising MySQL is limited) BioStor is struggling a bit. It's also gathered a lot of cruff as I've worked on ways to map article citations to the rather messy metadata in BHL.

So, I've started to play with a version that runs in the cloud using my favourite database, CouchDB. The data is hosted by Cloudant, which now provides full text search powered by Lucene. Essentially, I simply take article-level metadata from BioStor in BibJSON format and push that to Cloudant. I then wrote a simple wrapper around querying CouchDB, couple that with the Documentcloud Viewer to display articles and citeproc-js to format the citations (not exactly fun, but someone is bound to ask for them), and a we have a simple, searchable database of literature.

If you want to try the cloud-based version go to http://biostor-cloud.pagodabox.com/ (code on Github).

Bcloud

I've been wanting to do this for a while, partly because this is how I will implement my entry in EOL's computational data challenge, but also because CrossRef's Metadata search shows the power of finding references simply by using full text search (I've shamelessly borrowed some of the interface styling from Karl Ward's code). David Shorthouse demonstrates what you can do using CrossRef's tool in his post Conference Tweets in the Age of Information Overconsumption. Given how much time I spend trying to parse taxonomic citations and match them to articles in CrossRef's database, or BioStor, I'm looking forward to making this easier.

There are two major limitations of this cloud version of BioStor (aprt from the fact it has only a subset of the articles in BioStor). The first is that the page images are still being served from my Mac Mini, so they can be a bit slow to load. I've put the metadata and the search engine in the cloud, but not the images (we're talking a terabyte or two of bitmaps).

The other limitation is that there's no API. I hope to address this shortly, perhaps mimicking the CrossRef API so if one has code that talks to CrossRef it could just as easily talk to BioStor.

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994) http://biostor.org/reference/105389

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to http://data.gbif.org/occurrences/363089399/. Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as http://data.gbif.org/occurrences/478001337/. At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:
urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record http://data.gbif.org/occurrences/478001337/ contains a UUID as an alternative identifier:
bc58ce6b-3cc3-459a-9f5b-4a70a026afbe

If you Google this you discover a record in the Atlas of Living Australia http://biocache.ala.org.au/occurrences/bc58ce6b-3cc3-459a-9f5b-4a70a026afbe, which also lists the URN from the now deleted GBIF record http://data.gbif.org/occurrences/363089399/.

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...

The GBIF classification is broken — how do we fix it?

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help link to GBIF occurrence records.

The GBIF backbone classification ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.

On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (http://biostor.org/reference/88554, see also http://dx.doi.org/10.1080/00222939208677383) part of the classification looks like this:

└Animalia
└Annelida
└Polychaeta
└Sabellida
└Serpulidae
└Hipponyx
Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.

Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of Hipponyx and Hipponix under the genus Hipponix, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, Hipponix credited to DeFrance 1819 Journ. de Physique, 88, 217, and Hipponyx credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (Mémoire sur un nouveau genre de mollusque) and Bull. Sci. Soc. philom. Paris, 8. (Sur un nouveau genre de coquilles (Hipponix)). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling Hipponix (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).

OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).



So, bit of a mess, but this still doesn't deal with Hipponyx being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a specimen record from Te Papa labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:

Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. doi:10.1080/00288306.1971.10426332

with the following abstract:
An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.

The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.

The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.

The author of the paper (Charles Fleming) argues that Hipponyx inexpectata, regarded as a mollusc by its describer (Marjorie K. Mestayer, see Notes on New Zealand Mollusca. No. 4.) is actually a worm, and he moves it to the genus Temporaria.

So it seems that the reason Hipponyx has ended up being a worm in the GBIF classification is due to this synonymy.

Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think Hipponyx is a worm). As has been said several times before, most eloquently by David Shorthouse, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

GbifBrief update on yesterday's post about finding specimens in BioStor. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text. Of these 143,000 occurrences, 81,000 have been matched to an occurrence in GBIF.

The top ten collections with specimens in BioStor are:

DatasetNumber of specimens
NMNH Vertebrate Zoology Herpetology Collections (National Museum of Natural History)11194
Herpetology Collection (University of Kansas Biodiversity Research Center)9619
Herpetology Collection (University of Kansas Biodiversity Research Center)9328
NMNH Invertebrate Zoology Collections (National Museum of Natural History)9061
CAS Herpetology Collection Catalog (California Academy of Sciences)6720
MCZ Herpetology Collection (Museum of Comparative Zoology, Harvard University)5818
NMNH Vertebrate Zoology Fishes Collections (National Museum of Natural History)4642
MCZ Herpetology Collection - Reptile Database (Museum of Comparative Zoology, Harvard University)4380
FMNH Herpetology Collections (Field Museum)2110
FMNH Fishes Collections (Field Museum)2061


This is pretty much what I expected. Virtually complete runs of publications from The Field Museum at Chicago, the University of Kansas, and the Biological Society of Washington are available in BHL, and many of these have been added to BioStor. These journals have extensive taxonomic treatments of vertebrate taxa, particularly frogs, hence herpetology collections dominate the rankings.

There will inevitably be errors in the mapping between specimen codes and GBIF occurrences. I've tried to minimise these by mapping codes within taxonomic groups, but it's clear that there are duplicate codes even within some collections. There is also all manner of variation in the way people cite museum specimens, and these are often different from the codes that appear in GBIF. There will also be issues with extracting specimen codes, and I'm also discovering a few *cough* duplicates of articles in BioStor, so the numbers I present above are liable to change as I clean things up.

But one could imagine a "league table" of museum collections, where we can measure both the extent to which those collections have been digitised, and the extent to which material from those collections have been cited. We could use this to compute measures of the impact of a collection.

But for now I'm browsing the results trying to get a sense of how successful the mapping has been. There are some interesting examples. The specimen codes extracted from the article Review Of The Chewing Louse Genus Abrocomophaga (Phthiraptera : Amblycera), With Description Of Two New Species are those for the mammalian hosts of the lice. Hence someone viewing the records for these specimens and following the link to this paper would discover that these mammals had parasitic lice. If we add other sorts of links to the mix, such as between specimens and DNA sequences, then we can start to build a rich network of connections between the basic data of biodiversity.


Linking GBIF and the Biodiversity Heritage Library

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates. Given that many articles in BioStor list museum specimens I wrote some code to extract these (see Extracting museum specimen codes from text) and applied this to the OCR text for those articles.

Having a list of specimens is nice, but in this digital age I want to be able to find out more about these specimens. An obvious solution is try and match these specimen codes to the specimen records held by GBIF. Linking to GBIF is complicated by the fact that museum codes are not unique. For example, "FMNH 147942" could refer to a bird, an amphibian, or a mammal. To tackle the non uniqueness I use the taxonomic names extracted from each page by BHL to work out what taxon an article is mainly "about". To do this I use the Catalogue of Life classification to get "paths" for each name (i.e., the lineage of each taxon down to the root of the classification) and then find the majority-rule path. You can see these paths in the "Taxonomic classification" displayed on a page for a BioStor article. If there are multiple GBIF specimens for the same code I test whether the taxon or rank "class" in the GBIF record is in the majority-rule path for the article. If so, I accept that specimen as the match to the code.

There are also issues where the specimen codes in GBIF have been modified during input (e.g., USNM 730715 has become USNM 730715.457409). There are also the inevitable OCR errors that may cause museum codes to be missed or otherwise corrupted. Bearing all this in mind, BioStor now has specimen pages (these are still being generated as I write this). For example, the page for FMNH 147942 lists the three articles in BioStor that cite this specimen code:

Fmnh147942

All three specimens have been mapped on to GBIF occurrence http://data.gbif.org/occurrences/61846037/. When BioStor displays the articles it now lists the specimen codes that have been extracted from the article, together with the GBIF logo if the specimen has been matched to a GBIF record. For example, here is a screenshot from Deep-water octopods (Mollusca: Cephalopoda) of the northeastern Pacific:
Deepwater

The map has been extracted from the OCR text (an obvious next step would be to add localities associated with the specimen records). Below the map are the specimen codes. The lack of some USNM specimens is probably due to misinterpreted specimen codes, whereas the CAS specimens don't seem to be online (the California Academy of Sciences has some of its collections in GBIF, but not its molluscs).

Where next?
Once these links between BioStor (and hence, BHL) and GBIF are created then we can do some interesting things. If you visit BioStor and want to learn more about a specimen you can click on the link an view the record in GBIF. We could also envisage doing the reverse. GBIF could augment the information it displays about a specimen by displaying a link to the content in BioStor (e.g., "this specimen is cited by these articles"). Those articles may contain further information about that specimen (for example, the habitat it was collected from, how secure is its identification, and so on).

We could also start to compute the "impact" of different museum collections based on the number of citations of specimens from their collections (this idea is explored further in this paper: http://dx.doi.org/10.1093/bib/bbn022, free preprint available here: hdl:10101/npre.2008.1760.1).

All of this works because we are linking objects (in this case articles and specimens) via their identifiers. Consequently, the links are as stable as their identifiers, which is why I've been pursuing the issue of specimen identifiers recently (see here, here, and here). If GBIF maintains the URLs for the specimens I've linked to, then links I've created could persist. If these URLs are likely to change (e.g., because the metadata from the host institution has changed) then the links (and any associated value we get from them) disappear. This is why I want globally unique, resolvable, persistent identifiers for specimens.




Adding article-level metadata to BHL

Recently I've been thinking about the best ways to make article-level metadata from BioStor more widely available. For example, for someone visiting the BHL site there is no easy way to find articles, which are the basic unit for much of the scientific literature. How hard would it be to add articles to BHL? In the past I've wanted an all-singing all dancing article-level interface to BHL content (sort of BioStor on steroids), but that's a way off, and ideally would have a broader scope than BHL. So instead I've been thinking of ways to add articles to BHL without requiring a lot of re-engineering of BHL itself.

Looking at other digital archive projects like Gallica and Google Books it strikes me that if the BHL interface to a scanned item had a "Contents" drop down menu then users would be able to go to individual articles very easily. Below is a screen shot of how Gallica does this (see http://gallica.bnf.fr/ark:/12148/bpt6k61331684/f57).

Gallica

There's also a screen shot of something similar in Google Books (see http://books.google.co.uk/books?id=PkvoRnAM6WUC)

Contents

The idea would be that if BioStor had found articles within a scanned item, they would be listed in the contents menu (title, author, starting page), and if the user clicked on the article title then the BHL viewer would jump to that page. If there were no known articles, but the scanned item had a table of contents flagged (e.g., http://www.biodiversitylibrary.org/item/25703) then the menu could function as a button that takes you to that page. If there are no articles or contents, then the menu could be grayed out, or simply not displayed. This way the interface would work for books, monographs, and journal volumes.

Now, admittedly this is not the most elegant interface, and it treats articles as fragments of books rather than individual units, but it would be a start. It would also require minimal effort both on the part of BHL (who need to add the contents button), and myself (it would be easy to create a dump of the article titles indexed by scanned item).

Linking taxonomic names to literature: beyond digitised 5×3 index cards

Pubs
Tomorrow is the Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting. It should be an interesting gathering, albeit overshadowed by the sudden death of Frank Bisby.

I'm giving a talk entitled "Open Taxonomy", in which I argue that most taxonomic databases are little more than digitised collections of 5×3 index cards, where literature is treated as dumb citation strings rather than as resources with digital identifiers. To make the discussion concrete I've created a mapping between the Index to Organism Names (ION) database and a range of bibliographic sources, such as CrossRef (for DOIs), BioStor, JSTOR, etc.

This mapping is online at http://iphylo.org/~rpage/itaxon/.

So far I've managed to link some 200,000 animal names to a literature identifier, and a good fraction of these articles are freely available, either as images in BioStor and Gallica (for I've created a simple viewer) or as PDFs (which are displayed using Google Docs.

Some examples are:


The site is obviously a work in progress, and there's a lot to be done to the interface, but I hope it conveys the key point: a significant fraction of the primary taxonomic literature is online, and we should be linking to this. The days of digitised 5×3 index cards are past.



Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

More BHL app ideas

Hero rosellasFollowing on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elywreplied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiversitylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

n24_w1150Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:
  1. Use BHL web server logs to find and extract referrals from those projects
  2. Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.





Adding Solr to BioStor: searching for real

Solr

Prompted by the appearance on the BHL blog of an article about BioStor I've thinking about how to improve what is basically a fairly clunky tool.

One major weakness is searching the collection of nearly 40,000 articles extracted from BHL. Note the word "extracted." BioStor isn't a tool like PubMed or Google Scholar where the goal is to find articles on a topic. Instead it addresses a more specific question, namely whether a given article is contained in an item scanned by BHL. Confusion about this was one reason publication of my paper on BioStor (doi:10.1186/1471-2105-12-187) took so long to pass through the review stage.

However, users (myself included) expect to be able to search for articles. So, it's time to explore ways to make it easier to find articles within the BioStor database. I've junked the previous pretty crappy code I wrote and have started to play with the Solr search engine. I'd experimented with Solr a while ago, but other stuff got in the way. Today I've managed to add it to BioStor and do a preliminary indexing of the articles in BioStor. So far I'm only indexing basic bibliographic metadata, and displaying the first 30 hits, but already it's making it much easier to find interesting stuff in BioStor.

Solr also supports faceted searching (i.e., clustering results by categories such as year, author, journal). I don't so much with this yet, but there's clearly a lot of scope. I could also add taxonomic names, and even the OCR text to Solr, greatly expanding the ability to find articles. But that's for the future. For now, here are some interesting searches:




ZooBank on CouchDB: UUIDs, replication, and embedding the literature in taxonomic databases

ZooBankBannerLast December I released a web site called Australian Faunal Directory on CouchDB, which was part of my ongoing exploration of how to build a simple yet useful database of taxonomic names. In particular, I want to link names directly to the primary taxonomic literature. No longer is it adequate to simply list names, or list names with mangled bibliographic details (I'm looking at you, Catalogue of Life). This is the 21st century, so I expect one click from name to literature, or at the most two (via, say, a DOI). Nothing else will cut it.

CouchbaseThe Australian Faunal Directory (AFD) was an eye opener as it was the first serious use I'd made of CouchDB (now CouchBase). I'd played with replicating and forking data in 2010: Catalogue of Life and CouchDB, but the AFD project was bigger, and also inspired me to use web hooks to make the database editable. Suddenly this stuff started to look easy: no schema, simple web services, and tiny amounts of code.

ZooBank
So then my attention turned to ZooBank, which is "the official registry of Zoological Nomenclature, according to the International Commission on Zoological Nomenclature (ICZN)." ZooBank was proposed by Polaszek et al. (2005) in a short piece in Nature ("A universal register for animal names", doi:10.1038/437477a). By providing a registry of names for animals, ultimately it aims to help avoid embarrassing situations such as the example I recount in my paper on BioStor (doi:10.1186/1471-2105-12-187): a recent paper in Nature published the name Leviathan for an extinct sperm whale with a giant bite (doi:10.1038/nature09067), only for authors to have to publish an erratum with a new name (doi:10.1038/nature09381) when it was discovered that Leviathan had already been used for an extinct mammoth.

ZooBank is developed and run by Rich Pyle, and has some nice features, such as RDF export (via LSIDs), but like most taxonomic databases it doesn't link directly to the literature. Where are the DOIs? Where are links to BHL? Where is the ability to add these links? And why is it almost entirely about fish? (OK, I know the answer to that one).

CouchDB
But the thing which really got me thinking about using CouchDB to create a version of ZooBank was Rich Pyle's vision of having a distributed ZooBank, and his insistence on using ugly UUIDs in ZooBank identifiers (e.g., urn:lsid:zoobank.org:act:6BBEF50E-76B4-42EF-97B1-7029DBCD8257). As much as they are ugly, Rich has always argued that they make distributed systems easy because you don't need a centralised system to assign unique identifiers.

Anybody who has played with CouchDB will know that CouchDB uses UUIDs by default to create identifiers for database documents. It also excels at data synchronisation, and can run on platforms large and small (including mobile such as Android and iOS). This means a database could be updated on an iPhone or iPad without an Internet connection, then the data could be synchronised with other databases. Indeed, I developed this CouchDB clone of ZooBank on my MacBook, then pointed it at CouchDB running on my server and within minutes had an exact copy of the database running on the server. This ease of replication, together with the joy of schema-less design makes CouchDB seem an obvious fit to ZooBank.

Demo
You can see the ZooBank on CouchDB demo here. It's not a complete copy of ZooBank, but has most of it. I reuse the UUIDs issued by ZooBank, so that

http://zoobank.org:80/?uuid=6bbef50e-76b4-42ef-97b1-7029dbcd8257

becomes

http://iphylo.org/~rpage/zoobank/6bbef50e-76b4-42ef-97b1-7029dbcd8257

As usual it's all a bit crude, but has some nice features, such as links to BHL content with a built in article viewer I wrote for the AFD project:

EtheostomaWhat's next?
At present only a fraction of the ZooBank references have external links, I hope to add more in the next few days, using both automatic scripts and the web hook interface. The search interface needs work, and being that ZooBank is about nomenclature and not taxonomy, it might be useful to add a classification (say from the Catalogue of Life) so that users can navigate around the names (and get a sense of how many are *cough* fish).

At present to display a reference I do one of four things:
  1. If reference is in BHL I use my article viewer
  2. If there is a freely available PDF online I display that using Google Docs PDF viewer
  3. If 1 and 2 don't apply, but there is a DOI then I resolve the DOI and display the result in an IFRAME (yuck)
  4. If none of 1-3 apply I display a blank rectangle

There are a couple ways we could improve this. The first is to enhance the display of BHL content by making use of the structure of the source DjVu files. Another is to make use of the XML now being made available by the journal Zookeys (see my blog post, and Pensoft's announcement that ZooKeys is now being archived by PubMed Central, complete with taxonomic markup). There are a lot of ZooKeys articles in ZooBank, so there's a lot of potential for embedding an article viewer that takes Zookeys XML and redisplays it with taxonomic names and references as clickable links that link to other ZooBank content. That way we approach the point where taxonomic literature becomes a first class citizen of a taxonomic database.

BioStor article published (finally)

LogoMy article describing BioStor — "Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library" — has finally seen the light of day in BMC Bioinformatics (doi:10.1186/1471-2105-12-187, the DOI is not working at the moment, give it a little while to go live, meantime you can access the article here).

Getting this article published was more work than I expected. There seems to be an inverse correlation between how important I think the work is and how easy it is to get published — the more straightforward I think the article is the more work it is to convince the referees of its merits. Of course, it may be that my judgement of the article's merits influences how much effort I put into making the manuscript as rigorous and clear as possible. And perhaps having a blog has spoiled me, I really struggle with the notion that it takes months to publish a paper, especially as most of the intellectual debate involved (i.e., the refereeing process) is behind closed doors, compared to the open and immediate nature of commentary on a blog post.

However, despite my frustrations with the referring process, there's no doubt that it did improve the manuscript (you can see the original version at Nature Precedings, hdl:10101/npre.2010.4928.1).

With the publication of this article, and last week's conversation with Anurag Acharya and Darcy Dapra about getting BioStor indexed by Google Scholar, it has been a good few days for BioStor.



BHL, DjVu, and reading the f*cking manual

One of the many biggest challenges I've faced with the BioStor project, apart from dealing with messy metadata, has been handling page images. At present I get these from the Biodiversity Heritage Library. They are big (typically 1 Mb in size), and have the caramel colour of old paper. Nothing fills up a server quicker than thousands of images.

A while ago started playing with ImageMagick to resize the images, making them smaller, as well as ways to remove the background colour, leaving just black text and lines on white background.

Before and after converting BHL image


I think this makes the page image clearer, as well as removing the impression that this is some ancient document, rather than a scientific article. Yes, it's the Biodiversity Heritage Library, but the whole point of the taxonomic literature is that it lasts forever. Why not make it look as fresh as when it was first printed?

Working out how to best remove the background colour takes some effort, and running ImageMagick on every image that's downloaded starts putting a lot of stress on the poor little Mac Mini that powers BioStor.

Then there's the issue of having an iPad viewer for BHL, and making it interactive. So, I started looking at the DjVu files generated by the Internet Archive, and thinking whether it would make more sense to download those and extract images from them, rather than go via the BHL API. I'll need the DjVu files for the text layout anyway (see Towards an interactive DjVu file viewer for the BHL).

I couldn't remember the command to extract images from DjVu, but I did remember that Google is my friend, which led me to this question on Stack Overflow: Using the DjVu tools to for background / foreground seperation?.

OMG! DjVu tools can remove the background? A quick look at the documentation confirmed it. So I did a quick test. The page on the left is the default page image, the page on the right was extracted using ddjvu with the option -mode=foreground.

507.png


Much, much nicer. But why didn't I know this? Why did I waste time playing with ImageMagick when it's a trivial option in a DjVu tool? And why does BHL serve the discoloured page images when it could serve crisp, clean versions?

So, I felt like an idiot. But the other good thing that's come out of this is that I've taken a closer look at the Internet Archive's BHL-related content, and I'm beginning to think that perhaps the more efficient way to build something like BioStor is not through downloading BHL data and using their API, but by going directly to the Internet Archive and downloading the DjVu and associated files. Maybe it's time to rethink everything about how BioStor is built...

Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature

Following on from my previous post on microcitations I've blasted all the citations in Nomenclator Zoologicus through my microcitation service and created a simple web site where these results can be browsed.

The web site is here: http://iphylo.org/~rpage/nz/.

To create it I've taken a file dump of Nomenclator Zoologicus provided by Dave Remsen and run all the citations through the microcitation service, storing the results in a simple database. You can search by genus name, author and year, or publication. The search is pretty crude, and in the case of publications can be a bit hit and miss. Citations in Nomenclator Zoologicus are stored as strings, so I've used some crude rules to try and extract the publication name from the rest of the details (such as page numbering).

To get started, you can look at names published by published by Distant in 1910, which you can see below:

Nz1

If the citation has been found you can click on the icon to view the page in a popup, like this:

Nz2

You can also click on the page number to be taken to that page in BHL.


I've also added some other links, such as to the name in the Index to Organism Names, as well as bibliographic identifiers such as DOIs, Handles, and links to JSTOR and CiNii.

So far only 10% of Nomenclator Zoologicus records have a match in BHL, which is slightly depressing. Browsing through there are some obvious gaps where my parser clearly failed, typically where multiple pages are included in the citation, or the citation has some additional comments. These could be fixed. There are also cases where the OCR text is so mangled that a match has been rejected because the genus name and text were too different.

This has been hastily assembled, but it's one vision of a simple service where we can go from genus name to being able to see the original publication of that name. There are other things we could do with this mapping, such as enabling BHL to tell users that the reference they are looking at is the original source of a particular name, and enabling services that use BHL content (such as EOL and Atlas of Living Australia to flag which reference in BHL is the one that matters in terms of nomenclature.

BioStor updates on Twitter

BioStor has had a Twitter account @biostor_org for a while, but it's not been active. I finally got around to hooking it up to BioStor, so that now every time an article is added to BioStor, the title of that article and it's URL appears in the @biostor_org Twitter feed.



Activity on this feed will be variable, depending on whether articles are being added manually, or in bulk. But it's a handy way to keep tabs on the growing number of articles being harvested from the Biodiversity Heritage Library.

Mendeley, OpenURL, BioStor, and BHL

Mendeley has added a feature which makes it easier to use Mendeley with repositories such as BioStor and BHL. As announced in Get Full Text: Mendeley now works with your local library via OpenURL, you can now add OpenURL resolvers to your Mendeley account:
We’ve added a button to the catalog pages that will allow you to get the article from your library right in Mendeley. This feature will link you directly to the full text copy according to your institutional access rights.
Ironically, in the UK access to electronic articles from a University is pretty seamless via the UK Access Management Federation, so I don't need to add an OpenURL resolver to get full text for an article. But this new feature does enable another way to access to articles in my BioStor repository. By adding the BioStor OpenURL to your Mendeley account, you can search for articles from your Mendeley library in BioStor.

The Mendeley blog post explains how to set up an OpenURL resolver. Go to your Mendeley account and click on the My Account button in the upper right corner of then page, then select Account Details, then the Sharing/Importing tab, or just click here.

openurl_settings.jpg

Click on Add library manually, then enter the name of the resolver (e.g., "BioStor") and the URL http://biostor.org/openurl:

Snapshot 2011-03-01 07-37-20.png

If you view a reference in Mendeley, you will now see something like this:

Snapshot 2011-03-01 07-40-04.png

In addition to the DOI and the URL, this reference now displays a Find this paper at menu. Clicking on it shows the default services, together with any OpenURL resolvers you've added (in this case, BioStor):
Snapshot 2011-03-01 07-42-50.png

You can add multiple resolvers, so we could add the BHL OpenURL resolver http://www.biodiversitylibrary.org/openurl, although finding articles isn't BHL OpenURL resolver's strong point.

Now, what would be very handy is if Mendeley were to complete the circle by providing their own OpenURL resolver, so that people could find articles in Mendeley from metadata such as article title, journal, volume, and starting page. The Mendeley API might be a way to implement this, although its search features lack the granularity needed.

Why metadata matters

Quick note to express the frustration I experience sometimes when dealing with taxonomic literature. As part of a frankly Quixotic desire to link every article cited in the Australian Faunal Directory (AFD) to the equivalent online resource (for example, in the Biodiversity Heritage Library using BioStor, or to a publisher web site using a DOI) I sometimes come across references that I should be able to find yet can't. Often it turns out that the metadata for the article is incorrect. For example, take this reference:
Report upon the Stomatopod crustaceans obtained by P.W. Basset-Smith Esq., surgeon R.N. during the cruise, in the Australia and China Sea, of H.M.S. "Penguin", commander W.V. Moore. Ann. Mag. Nat. Hist. Vol. 6 pp. 473-479 pl. 20B
which is in the Australian Faunal Directory (urn:lsid:biodiversity.org.au:afd.publication:087892ae-2134-4bb4-83ae-8b8cbd15b299). Using my OpenURL resolver in BioStor I failed to locate this article. Sometimes this is because the code I used to parse references from AFD mangles the reference, but not in this case. So, I Google the title and find a page in the Zoological catalogue of Australia: Aplacophora, Polyplacophora, Scaphopoda:


Here's the relevant part of this page:
Zoocat
Same as AFD, Ann. Mag. Nat. Hist. volume 6, pages 473-479, 1893.

In despair I looked at the BHL page for The Annals and Magazine of Natural History and discover that there is no volume 6 published in 1893. There is, however, series 6. Oops! Browsing the BHL content I discover the start of the article I'm looking for on BHL page 27734740 , volume 11 of series 6 of The Annals and Magazine of Natural History. Gotcha! So, I can now link AFD to BHL like this.

I should stress that in general AFD is an great resource for someone like me trying to link names to literature and, to be fair, with its reuse of volume numbers across series The Annals and Magazine of Natural History can be a challenge to cite. Usually the bibliographic details in AFD are accurate enough to locate articles in BHL or CrossRef, but every so often references get mangled, misinterpreted, or someone couldn't resist adding a few "helpful" notes to a field in the database, resulting in my parser failing. What is slightly alarming is how often when I Google for the reference I find the same, erroneous metadata repeated across several articles. This, coupled with the inevitable citation mutations can make life a little tricky. The bulk of the links I'm making are constructed automatically, but there are a few cases where one is lead on a wild goose chase to find the actual reference.

Although this is an example of why it matters to have accurate metadata, it can also be seen as an argument for using identifiers rather than metadata. If these references had stable, persistent identifiers (such as DOIs) that taxonomic databases cited, then we wouldn't need detailed metadata, and we could avoid the pain of rummaging around in digital archives trying to make sense of what the author meant to cite. Until taxonomic databases routinely use identifiers for literature, names and literature will be as ships that pass in the night.

Web Hooks and OpenURL: the screencast

Yesterday I posted notes on Web Hooks and OpenURL. That post was written when I was already late (you know, when you say to yourself "yeah, I've got time, it'll just take 5 minutes to finish this..."). The Web Hooks + OpenURL project is still very much a work in progress, but I thought a screen cast would help explain why I think this is going to make my life a lot easier. It shows an example where I look at a bibliographic record in one database (AFD, the Australian Faunal Directory on CouchDB), click on a link that takes me to BioStor — where I can find the reference in BHL — then simply click on a button on the BioStor page to "automagically" update the AFD database. The "magic" is the Web Hook. The link I click on in the AFD database contains the identifier for that entry in the AFD, as well a a URL BioStor can call when it's found the reference (that URL is the "web hook").

Using Web Hooks and OpenURL from Roderic Page on Vimeo.