Search this keyword

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...

BHL and text-mining: some ideas

Some quick notes on possibilities for text-mining BHL (in rough order of priority). Any text-mining would have to be robust to OCR errors. I've created a group of OCR-related papers on Mendeley:

OCR - Optical Character Recognition is a group in Computer and Information Science on Mendeley.

Improve finding taxonomic names in text in face of OCR errors

There is some published research on OCR errors that could be used to develop a tool to improve our ability to index OCR text. The outcome would be improved search in BHL (and other archives). I've touched on some of these issues earlier). One approach that looks interesting is using anagram hashing (see Reynaert, 2008), which may be a cheap way to support approximate string matching in OCR text.

Reynaert, M. (2008). Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. Lecture Notes in Computer Science, 4919:617-630. doi:10.1007/978-3-540-78135-6_53 (PDF here).


Recognition and extraction of literature cited

Given an article extract all the references it cites. There's a fair amount of literature on automated citation extraction, but again we need to do this in the face of OCR errors, and enormous variability in citation styles. The outputs could help build citation indexes, and also serve as data for the "bibliography of life". The citations could also be used to help locate further articles in BHL (e.g., using BioStor's OpenURL resolver).


Improved extraction of named entities (e.g., museum specimen codes) and localities (e.g., latitude and longitudes, place names)

This would enable better geographic searches, and help start to link literature to museum specimen databases.

Automated recognition of articles within scanned volumes

My own approach to finding articles has focussed on finding articles based on citation metadata, e.g. based on article title, journal, volume, and pagination, find corresponding article in BHL:

Page, R. D. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(1), 187. doi:10.1186/1471-2105-12-187

An alternative is to infer articles from just the scanned pages. There has been some limited work on this in the context of BHL:

Lu, X., Kahle, B., Wang, J. Z., & Giles, C. L. (2008). A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’08 (p. 167). Association for Computing Machinery (ACM).
doi:10.1145/1378889.1378918 (PDF here)

The NLM has some cool stuff on automatically labelling the parts of a document, see Automated Labeling in Document Images and Ground truth data for document image analysis. See also Distance Measures for Layout-Based Document Image Retrieval.

Other links
Should also note that there's a relevant question on StackOverflow about OCR correction, which has links to tools like OCRspell:

Taghva, K., & Stofsky, E. (2001). OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3), 125–137. doi:10.1007/PL00013558

Code is on github.

Fictional taxa

Anyone who works with taxonomic databases is aware of the fact that they have errors. Some taxonomic databases are restricted in scope to a particular taxon in which one or more people have expertise, these then get aggregated into larger databases, which may in turn be aggregated by databases whose scope is global. One consequence of this is that errors in one database can be propagated through many other databases.

As an example (for reasons I can't remember), I came across the name "Panisopus" (in the water mote family Thyasidae) but was struggling to find any mention of the taxonomic literature associated with this name. If you Google Panisopus the first two pages are full of search results from ITIS, EOL, GBIF, ZipCodeZoo, all listing several species in the genus, and sometimes taxonomic authorities, but no links to the primary literature. If you search BHL for Panisopus you get nothing, nothing at all. It's as if the name didn't exist.

Turns out, that's exactly the point. The name doesn't exist, other than in the various databases that have consumed other databases and recycled this fictional taxon. After some Googling of author's names it became clear that "Panisopus" is probably a misspelling of "Panisopsis", which according to ION was published in:

Viets, K. (1926) Eine nomenklatorische Aenderung im Hydracarinen-Genus Thyas C. L. Koch. Zool Anz Leipzig, 66: 145--148

I can't verify this because this article is not available online. But to give one example, ITIS lists the name "Panisopus pedunculata Keonike, 1895" (TSN 83185). This name should be, as far as I can tell, Panisopsis pedunculata (Koenike, 1895), based on Mitchell, 1954 (http://biostor.org/reference/104266, http://dx.doi.org/10.5962/bhl.title.3110) who on page 36 states:

Mitchell

Note that Panisopsis pedunculata was originally described in a different genus (Koenike 1895 preceeds the publication of the genus name by Viets in 1926). We can locate Koenike's original publication "Nordamerikanische Hydrachniden" in BHL, which I've added to BioStor http://biostor.org/reference/104265, and the original description appears on p. 192 as Thyas pedunculata (note that ITIS misspells the author's name Koenike [o and e transposed], as well as omitting the parentheses around the name).

What I find a little alarming (if not surprising) is that the entirely fictional genus "Panisopus" its accompanying species have ended up in numerous taxonomic databases, and these databases consistently appear in the top Google searches for this name. The good news is that it's becoming increasingly easy to discover these errors, in part because more and more taxonomic literature is coming online, making it possible for users to investigate matters for themselves, rather than rely on unsupported statements in taxonomic databases. I'm continually amazed by how little evidence most taxonomic databases provide for any of the assertions that they make. If a database includes a name, I want some evidence that the name is "real". Show me the publication, or at least give me a citation that I can follow up. I can't take these databases on blind faith, because demonstrably they are replete with errors. Ironically, one measure of success in the Internet age is being in the top 10 hits for a Google search. Now, if the top ten hits are all taxonomic databases I get very, very nervous. It's a good sign the name only exists in those databases.