Search this keyword

Showing posts with label BHL. Show all posts
Showing posts with label BHL. Show all posts

New look Biodiversity Heritage Library launched


The new look Biodiversity Heritage Library has just launched. It's a complete refresh of the old site, based on the Biodiversity Heritage Library–Australia site. If you want an overview of what's new, BHL have published a guide to the new look site. Congrats to involved in the relaunch.

One of the new features draws on the work I've been doing on BioStor. The new BHL interface adds the notion of "parts" of an item, which you can see under the "Table of Contents" tab. For example, the scanned volume 109 of the Proceedings of the Entomological Society of Washington now displays a list of articles within that volume:

Newbhl
This means you can now jump to individual articles. Before you had to scroll through the scan, or click through page numbers until you found what you were after. The screenshot above shows the article "Three new species of chewing lice (Phthiraptera: Ischnocera: Philopteridae) from australian parrots (Psittaciformes: Psittacidae)". The details of this article have been extracted from BioStor, where this article appears as http://biostor.org/reference/55323. You can go directly to this article in BHL using the link http://www.biodiversitylibrary.org/part/69723. As an aside, I've chosen this article because it helps demonstrate that BHL has modern content as well as pre-1923 literature, and this article names a louse, Neopsittaconirmus vincesmithi after a former student of mine, Vince Smith. You're nobody in this field unless you've had a louse named after you ;)

BioStor has over 90,000 articles, but this is a tiny fraction of the articles contained in BHL content, so there's a long way to go until the entire archive is indexed to article level. There will also be errors in the article metadata derived from BioStor. If we invoke Linus's Law ("given enough eyeballs, all bugs are shallow") then having this content in BHL should help expose those errors more rapidly.

As always, I have a few niggles about the site, but I'll save those for another time. For noe, I'm happy to celebrate an extraordinary, open access archive of over 40 million pages. BHL represents one of the few truly indispensable biodiversity resources online.

Does the legacy biodiversity literature matter?

I've just come back from a pro-iBiosphere Workshop at Leiden where the role of "legacy literature" became the subject of some discussion. This continued on Twitter as Ross Mounce (@rmounce) and I went back and forth:
Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:
MAY, R. M. (1988). How Many Species Are There on Earth? Science, 241(4872), 1441-1449. doi:10.1126/science.241.4872.1441
In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:
Jones, K. E., Bielby, J., Cardillo, M., Fritz, S. A., O Dell, J., Orme, C. D. L., & Purvis, A. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. (W. K. Michener, Ed.)Ecology, 90(9), 2648-2648. doi:10.1890/08-1494.1
PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:

Decade startingPublications
18401
18601
18901
190010
19104
192014
193048
194061
1950114
1960295
1970527
1980865
19901019
2000183
Pantheria The bulk of the papers came from the second half of the 20th century, and many of these are "legacy" in the sense that they are in archives like JSTOR, and hence the PDFs are based on scanned images and OCR. The oldest papers are from the 19th century, which is legacy by anyone's definition. My interpretation of this data is that even for a well-studied group such as mammals, the basic organismal-level data sought by comparative biologists is in the "legacy" literature. My suspicion is that if we attempt to build PanTHERIA-style databases for other, less well-studied taxa, the data (if it exists at all) will be found not in the modern literature (where the focus has long since moved on from the organism to genomics and system biology) but in the corpus of taxonomic and ecological literature that are being scanned and stored in digital archives.

Update
I've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.

Reading the Biodiversity Heritage Library using Readmill

Readmill reasonably smalltl;dr Readmill might be a great platform for shared annotation and correction of Biodiversity Heritage Library content.

Thinking about accessing the taxonomic literature I started revisiting previous ideas. One is DeepDyve (see DeepDyve - renting scientific articles). Imagine not having to pay large sums for an article, but being able to rent it. Yes, open access would be great, but ultimately it's all a question of money (who pays and when), the challenge is to find the mix of models that encourage people to digitise the relevant literature. Instead of publishers insisting we pay $US30 for an article, how about renting it for the short time we actually need to read it?

Another model is unglue.it, a Kickstarter-like company that seeks to raise funds to digitise and make freely available e-Books. unglue.it has campaigns where people pledge donations, and if sufficient pledges are made the book's rights-holder has the book digitised and released DRM-free.

Looking at unglue.it I stumbled across Readmill, "a curious community of readers, highlighting and sharing the books they love." Readmill has an iPad app where you can highlight passages of text and add your own annotation. These annotations can be shared, and multiple people can read and comment on the same book. Imagine doing this on BHL content. You could highlight parts of the text where the OCR has failed, and provide a correction. You could highlight taxonomic names that automatic parsers have missed, geographic localities, cited literature, etc. All within a nice, social app.

Even better, Readmill has an API. You can retrieve highlights and comments on those highlights. So, if someone flags a sentence as mangled OCR and provides a correction, that correction could be harvested and feed back to, say, BHL. These corrections could be used to improve searches, as well as the text delivered when generating searchable PDFs, etc.

You can even add highlights via the API, so we could upload a ePub book then add all the taxonomic names found by uBio or NetiNeti, enabling users to see which bits of text are probably names, correcting any mistakes along the way. Instead of giving readers a blank canvas they could already have annotations to start with.

Building an app from scratch to read and annotate BHL content would be a major undertaking. From my cursory initial look I wonder if Readmill might just provide the platform we need to clean up and annotate key parts of the BHL corpus?

BHL is duplicating DOIs because it doesn't know about articles

Quick note that as much as I like that the Biodiversity Heritage Library is using DOIs, they are generating them for publications that already have them (or are acquiring them from other sources). For example, here are the two DOIs for the same article (formatted using the DOI Citation Formatter), one from BHL and one from the Smithsonian:

Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes / Victor G. Springer. Smithsonian Institution. doi:10.5962/bhl.title.37141
Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes. Smithsonian Contributions to Zoology, (367), 1–182. doi:10.5479/si.00810282.367


The BHL DOI resolves to a page in BHL, the other DOI resolves to the a page in the Smithsonian Digital Repository (this article also has the handle hdl:10088/5222).

Now this is a problem, because DOIs are meant to be unique: one article, one DOI. I've encountered duplicates elsewhere, but in these cases one should be an alias of the other. In the example above, the DOIs resolve to different locations. If you are just after the content this isn't a huge problem, but if, say, you were using the DOI to uniquely identify the publication (say, in a database) you have a problem: which DOI to choose? If you and I choose differently then we will make statements about the same article but be unaware of that sameness.

Much of this problem arises because BHL has no concept of articles. Most articles are likely to reside within scanned volumes of a journal, but some articles (e.g., monographs) may be treated a single title by BHL, and each BHL title now gets a DOI.

I know that handling articles is on BHL's radar, but it because it hasn't tackled it yet we are going to have cases where BHL DOIs duplicate existing DOIs. In these cases, BHL may have to make their DOI an alias of the other DOI.

Building a BHL Africa: BHL in a box

Was going to post this as a comment on the BHL blog but they use Blogger's native comment system, which is horrible, and it refused to accept my comment (yes, yes, I'm sure it did that on grounds of taste). I read the recent post Building a BHL Africa and couldn't believe my eyes when I read the following:

the "BHL in a Box" concept was highly desired. This would entail creating interactive CDs of BHL content for distribution in areas where internet access is unreliable or unavailable.
CDs! Really? Surely this is crazy!?. You want to use an obsolete technology that require additional obsolete technology to ship BHL around Africa? Why not ship relevant parts of BHL on iPads? Lots more storage space than CDs, built-in interactivity (obviously need to write an app, but could use HTML + Javascript as a starting point), long battery life, portable, comes with 3G support if needed. I'll be the first to admit that my knowledge of Africa is about zero, but given that mobile devices are common, mobile networks are fairly well developed, and tablets are making inroads (see iPad has become a big factor in African business) surely "BHL mobile" is the way to go to provide "BHL in a box", not CDs.

Why not develop an app that stores BHL content on a device like an iPad, then distribute those? Support updating the content over the network so the user isn't stuck with content they no longer need. In effect, something like Amazon's Kindle app or iBooks would do the trick. You'd need to compress BHL content to keep the size down (the images BHL currently displays on its web site could be made a lot smaller) but this is doable. Indeed, the BHL Africa could be an ideal motivation to move BHL to platforms such as phones and tablets, where at the moment users have to struggle with a website that makes no concessions to those devices.

Postscript
Of course, it doesn't have to be the iPad as such. Imagine if BHL published books and articles on Amazon, then used Kindle to deliver content physically (i.e., ship Kindles), and anyone else could access it directly from Amazon using their Kindle (or Kindle app on iPad).

70,000 articles extracted from the Biodiversity Heritage Library

Biostor shadowJust noticed that BioStor now has just over 70,000 articles extracted from the Biodiversity Heritage Library. This number is a little "soft" as there are some duplicates in the database that I need to clean out, but it's a nice sounding number. Each article has full text available, and in most cases reasonably complete metadata.

Most of the articles in BioStor have been added using semi-automated methods, but there's been rather more manual entry than I'd like to admit. One task that does have to be done manually is attaching plates to papers. This is largely an issue for older publications, where printing text and figures required different processes, resulting in text and figures often being widely separated in the publication. Technology evolved, and the more recent literature doesn't have this problem.

Future plans include adding the ability to download the articles as searchable PDFs, and to support OCR correction, amongst other things. BioStor also underpins some of my other projects, such as the EOL Challenge entry, which as of now has around 80,000 animal names linked to their original description in BioStor (and some 300,000 in total linked to some form of digital identifier). One day I may also manage to get the article locations into BHL itself, so that when you browse a scanned item in BHL you can quickly find individual articles. Oh, and it would be cool to have all this on the iPad...

BHL and text-mining: some ideas

Some quick notes on possibilities for text-mining BHL (in rough order of priority). Any text-mining would have to be robust to OCR errors. I've created a group of OCR-related papers on Mendeley:

OCR - Optical Character Recognition is a group in Computer and Information Science on Mendeley.

Improve finding taxonomic names in text in face of OCR errors

There is some published research on OCR errors that could be used to develop a tool to improve our ability to index OCR text. The outcome would be improved search in BHL (and other archives). I've touched on some of these issues earlier). One approach that looks interesting is using anagram hashing (see Reynaert, 2008), which may be a cheap way to support approximate string matching in OCR text.

Reynaert, M. (2008). Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. Lecture Notes in Computer Science, 4919:617-630. doi:10.1007/978-3-540-78135-6_53 (PDF here).


Recognition and extraction of literature cited

Given an article extract all the references it cites. There's a fair amount of literature on automated citation extraction, but again we need to do this in the face of OCR errors, and enormous variability in citation styles. The outputs could help build citation indexes, and also serve as data for the "bibliography of life". The citations could also be used to help locate further articles in BHL (e.g., using BioStor's OpenURL resolver).


Improved extraction of named entities (e.g., museum specimen codes) and localities (e.g., latitude and longitudes, place names)

This would enable better geographic searches, and help start to link literature to museum specimen databases.

Automated recognition of articles within scanned volumes

My own approach to finding articles has focussed on finding articles based on citation metadata, e.g. based on article title, journal, volume, and pagination, find corresponding article in BHL:

Page, R. D. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(1), 187. doi:10.1186/1471-2105-12-187

An alternative is to infer articles from just the scanned pages. There has been some limited work on this in the context of BHL:

Lu, X., Kahle, B., Wang, J. Z., & Giles, C. L. (2008). A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’08 (p. 167). Association for Computing Machinery (ACM).
doi:10.1145/1378889.1378918 (PDF here)

The NLM has some cool stuff on automatically labelling the parts of a document, see Automated Labeling in Document Images and Ground truth data for document image analysis. See also Distance Measures for Layout-Based Document Image Retrieval.

Other links
Should also note that there's a relevant question on StackOverflow about OCR correction, which has links to tools like OCRspell:

Taghva, K., & Stofsky, E. (2001). OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3), 125–137. doi:10.1007/PL00013558

Code is on github.

Fictional taxa

Anyone who works with taxonomic databases is aware of the fact that they have errors. Some taxonomic databases are restricted in scope to a particular taxon in which one or more people have expertise, these then get aggregated into larger databases, which may in turn be aggregated by databases whose scope is global. One consequence of this is that errors in one database can be propagated through many other databases.

As an example (for reasons I can't remember), I came across the name "Panisopus" (in the water mote family Thyasidae) but was struggling to find any mention of the taxonomic literature associated with this name. If you Google Panisopus the first two pages are full of search results from ITIS, EOL, GBIF, ZipCodeZoo, all listing several species in the genus, and sometimes taxonomic authorities, but no links to the primary literature. If you search BHL for Panisopus you get nothing, nothing at all. It's as if the name didn't exist.

Turns out, that's exactly the point. The name doesn't exist, other than in the various databases that have consumed other databases and recycled this fictional taxon. After some Googling of author's names it became clear that "Panisopus" is probably a misspelling of "Panisopsis", which according to ION was published in:

Viets, K. (1926) Eine nomenklatorische Aenderung im Hydracarinen-Genus Thyas C. L. Koch. Zool Anz Leipzig, 66: 145--148

I can't verify this because this article is not available online. But to give one example, ITIS lists the name "Panisopus pedunculata Keonike, 1895" (TSN 83185). This name should be, as far as I can tell, Panisopsis pedunculata (Koenike, 1895), based on Mitchell, 1954 (http://biostor.org/reference/104266, http://dx.doi.org/10.5962/bhl.title.3110) who on page 36 states:

Mitchell

Note that Panisopsis pedunculata was originally described in a different genus (Koenike 1895 preceeds the publication of the genus name by Viets in 1926). We can locate Koenike's original publication "Nordamerikanische Hydrachniden" in BHL, which I've added to BioStor http://biostor.org/reference/104265, and the original description appears on p. 192 as Thyas pedunculata (note that ITIS misspells the author's name Koenike [o and e transposed], as well as omitting the parentheses around the name).

What I find a little alarming (if not surprising) is that the entirely fictional genus "Panisopus" its accompanying species have ended up in numerous taxonomic databases, and these databases consistently appear in the top Google searches for this name. The good news is that it's becoming increasingly easy to discover these errors, in part because more and more taxonomic literature is coming online, making it possible for users to investigate matters for themselves, rather than rely on unsupported statements in taxonomic databases. I'm continually amazed by how little evidence most taxonomic databases provide for any of the assertions that they make. If a database includes a name, I want some evidence that the name is "real". Show me the publication, or at least give me a citation that I can follow up. I can't take these databases on blind faith, because demonstrably they are replete with errors. Ironically, one measure of success in the Internet age is being in the top 10 hits for a Google search. Now, if the top ten hits are all taxonomic databases I get very, very nervous. It's a good sign the name only exists in those databases.

BHL to PDF workflow

Just some random thoughts on creating searchable PDFs for article extracted from BHL.

Workflow

BHL and GBIF as biomedical databases

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research. Again, great stuff, but aren't museums simply full of dead stuff that people have collected and forgotten about?

But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of linking BHL and GBIF content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):

Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis by Davis et al..

Malacologia157576inst 0263

This paper has been indexed in PubMed (PMID:948206, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).

The article page in BioStor shows a map derived from the OCR text, showing a two localities:

Mekong

Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., ANSP Malacology 330925).

If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis et al. paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature.

It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.

Linking GBIF and the Biodiversity Heritage Library

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates. Given that many articles in BioStor list museum specimens I wrote some code to extract these (see Extracting museum specimen codes from text) and applied this to the OCR text for those articles.

Having a list of specimens is nice, but in this digital age I want to be able to find out more about these specimens. An obvious solution is try and match these specimen codes to the specimen records held by GBIF. Linking to GBIF is complicated by the fact that museum codes are not unique. For example, "FMNH 147942" could refer to a bird, an amphibian, or a mammal. To tackle the non uniqueness I use the taxonomic names extracted from each page by BHL to work out what taxon an article is mainly "about". To do this I use the Catalogue of Life classification to get "paths" for each name (i.e., the lineage of each taxon down to the root of the classification) and then find the majority-rule path. You can see these paths in the "Taxonomic classification" displayed on a page for a BioStor article. If there are multiple GBIF specimens for the same code I test whether the taxon or rank "class" in the GBIF record is in the majority-rule path for the article. If so, I accept that specimen as the match to the code.

There are also issues where the specimen codes in GBIF have been modified during input (e.g., USNM 730715 has become USNM 730715.457409). There are also the inevitable OCR errors that may cause museum codes to be missed or otherwise corrupted. Bearing all this in mind, BioStor now has specimen pages (these are still being generated as I write this). For example, the page for FMNH 147942 lists the three articles in BioStor that cite this specimen code:

Fmnh147942

All three specimens have been mapped on to GBIF occurrence http://data.gbif.org/occurrences/61846037/. When BioStor displays the articles it now lists the specimen codes that have been extracted from the article, together with the GBIF logo if the specimen has been matched to a GBIF record. For example, here is a screenshot from Deep-water octopods (Mollusca: Cephalopoda) of the northeastern Pacific:
Deepwater

The map has been extracted from the OCR text (an obvious next step would be to add localities associated with the specimen records). Below the map are the specimen codes. The lack of some USNM specimens is probably due to misinterpreted specimen codes, whereas the CAS specimens don't seem to be online (the California Academy of Sciences has some of its collections in GBIF, but not its molluscs).

Where next?
Once these links between BioStor (and hence, BHL) and GBIF are created then we can do some interesting things. If you visit BioStor and want to learn more about a specimen you can click on the link an view the record in GBIF. We could also envisage doing the reverse. GBIF could augment the information it displays about a specimen by displaying a link to the content in BioStor (e.g., "this specimen is cited by these articles"). Those articles may contain further information about that specimen (for example, the habitat it was collected from, how secure is its identification, and so on).

We could also start to compute the "impact" of different museum collections based on the number of citations of specimens from their collections (this idea is explored further in this paper: http://dx.doi.org/10.1093/bib/bbn022, free preprint available here: hdl:10101/npre.2008.1760.1).

All of this works because we are linking objects (in this case articles and specimens) via their identifiers. Consequently, the links are as stable as their identifiers, which is why I've been pursuing the issue of specimen identifiers recently (see here, here, and here). If GBIF maintains the URLs for the specimens I've linked to, then links I've created could persist. If these URLs are likely to change (e.g., because the metadata from the host institution has changed) then the links (and any associated value we get from them) disappear. This is why I want globally unique, resolvable, persistent identifiers for specimens.




Mendeley as CiteBank: some ideas

Here are some quick notes on how BHL could use Mendeley as a "CiteBank".

As a repository of bibliographic data

If the goal is to assemble a "bibliography of life" then there are various ways this could be done.

Taxon-specific bibliographies

Create groups that are taxon-specific (or find existing groups in Mendeley. For example, I've created groups for amphibias (Amphibian Species of the World) and reptiles (TIGR/JCVI Reptile Database) based on the Amphibian Species of the World and TIGR/JCVI Reptile Database, respectively. Taxon-specific groups are probably going to be attractive to users, but the quality of bibliographic metadata can be variable. However, a bibliography for a specific taxonomic group that is populated with links to BHL content would be very useful.

Journal-specific bibliographies

This is where I've spent most of my efforts. I've created around 300 groups for various journals (see list below, or go directly to http://dl.dropbox.com/u/639486/groups.html). In some cases I've managed to populate these with the complete set of articles published in that journal, typically harvested from the journal's own web site. Typically the metadata from journal sites is high quality, although one has to be wary of Orwellian metadata.



I use these groups in two ways. The first is as a source of metadata for extracting articles from BHL using BioStor. If you have article-level metadata finding articles in BHL becomes easier, and can be automated so that 1000's can be added in a few minutes.

The second is for the taxon-literature mapping project, where one strategy is to use approximate string mapping to find equivalent citations in Mendeley and the ION database. Ultimately I'd like to link to the Mendeley citations as they tend to be higher quality than those in the original ION database.

BHL could create Mendeley groups for journals it has scanned, and populate those.

As an article-level index to BHL

This is perhaps the most direct way BHL could use Mendeley is as follows:

  1. Create a BHL account.
  2. For each BHL title create a Mendeley group (the name would be the BHL TitleID).
  3. For each item in that title create a folder in the corresponding group (the folder name would be the ItemID).
  4. Within each folder list the articles, book chapters or other component parts. If these aren't available yet, encourage people to add them. Some of these could be pre-populated with content from BioStor.
  5. Harvest the contents of these groups to provide an article-level index to BHL (which for me is the single biggest impediment to using BHL). Previously I've suggested a way to easily add article data to BHL, Mendeley title/item groups and folders might be way to facilitate this process.
PDF storage

Although Mendeley offers PDF storage, this is one feature I'd be less inclined to use. Mendeley's rule for sharing PDFs and making them publicly available are too restrictive (they often don't know whether a PDF can, in fact, be shared). Plus you want tools to visualise, index, and archive PDFs. In effect a big file store with added features. I have some ideas on how this can be implemented (and have a rough working version to support http://iphylo.org/~rpage/itaxon). Alternatively, one could use Internet Archive services.

Summary

As I've often argued, given the success of tools like Mendeley it seems pointless for anyone to try and build yet another online bibliographic database. The trick is to figure out how to leverage what Mendeley provides to support what the taxonomic (and broader biodiversity) community needs.

Journals I'd like BHL to scan

I've recently updated my database of links between animal taxonomic names and literature identifiers, which now has over 280,000 names linked to some form of identifier (127,000 of these being DOIs). You can see the current version here:

http://iphylo.org/~rpage/itaxon/

As an experiment I've added a feature to list the number of names for each journal. Based on this list (limited to journals that I've found an ISSN for) here are some journals I'd like to see digitised by the Biodiversity Heritage Library (BHL). Note that by digitised I mean beyond the 1923 cutoff applied to many journals. This will mean negotiating with the journal publishers, but in a number of cases these are scientific societies or institutions, some associated with BHL. Given that major partners in BHL have made post-1923 content available, it would nice to extend this to other key taxonomic journals.

Revue Suisse de Zoologie

Revue Suisse de Zoologie has published nearly 10,000 taxonomic names but has essentially zero digital presence, which is extraordinary. Another Swiss journal, Entomologica Basiliensia is also an obvious candidate.

Revue de Zoologie et de Botanique Africaines

Revue de Zoologie et de Botanique Africaines has published over 5,000 names, and given the interest in providing information resources for Africa (e.g., http://www.mendeley.com/groups/1681811/bhl-africa/) this seems an obvious journal to scan completely.

Bulletin of the British Museum (Natural History) journals and books

The Natural History Museum [formerly British Museum (Natural History)] is a member of BHL so I'd expect it to have better coverage of it's own publications in BHL. There are gaps in journals such as Bulletin of the British Museum (Natural History) Entomology, which means there is a significant chunk of research published by Museum staff that simply doesn't exist digitally. At one point The Natural History Museum renamed the journals and moved them to Cambridge University Press, resulting in further gaps in digitisation. It's interesting that museums that haven't changed the title of their publications (such as the American Museum of Natural History and the Australian Museum) have better digital coverage than the NHM, which has flirted with various title changes in the last few decades. The Museum also published a series of monographs in the 20th century, many of these aren't in BHL.

Memoirs of the Queensland Museum

The Memoirs of the Queensland Museum is an important journal (> 3,000 names) but has only early issues scanned in BHL and recent issues as PDFs on the Museum web site (vulnerable to link rot when the site gets redesigned, as I've discovered to my cost).

Russian journals

Russian journals contain large numbers of taxonomic descriptions, but their digital presence is patchy. Springer has started to publish translations online (e.g., http://dx.doi.org/10.1134/S0013873810050155 in Entomological Review, which is a translation of an article in Zoologicheskii Zhurnal), but much of the Russian literature seems unavailable in digital form. BHL has spread from it's US-UK origins to BHL-Europe, BHL_China, and BHL_Australia, maybe it's time for BHL-Russia?

Summary

There are huge holes in the availability of taxonomic literature (where I equate "availability" with being digitised and online, free or otherwise). But on the other hand I've been pleasantly surprised by just how much taxonomic literature is online. It looks quite feasible to link at least 300,000 animal names to digital publications.

The journals I've highlighted are just a few obvious candidate for scanning. I suspect that as one goes down the list of taxonomic journals the rate of return will decline, to the point where scanning entire journals will be less efficient than scanning targeted articles.



Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxonAlpheus

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Towards the bibliography of life

David King et al.'s paper "Towards the bibliography of life" http://dx.doi.org/10.3897/zookeys.150.2167 has just appeared in a special issue of ZooKeys. I've written a number of posts on this topic, so I've a few comments.

King et al. survey some of the issues, but don't really tackle the big issue of how we're going to build this. If we define the "bibliography of life" somewhat narrowly as the list of all papers that have published a scientific name (or a new combination, such as moving a species from one genus to another), then this is a large, but measurable undertaking. According to ION's metrics page, these are the numbers involved (for animals and protozoa):

Total New Names1,510,402
Total New Genera / Subgenera215,242
Total New Species / Subspecies1,192,366
Total Other New Names102,794
Total New Combinations241,296
Total New Synonyms260,544


Even in the worse case scenario of one name per publication (clearly not the case) this is big, but not insurmountable, task.

Publications not taxa
Part of the challenge is figuring out the best way to tackle the problem. In the past, most efforts at building taxonomic bibliographies have focussed on specific taxa, which is natural — the bibliographies are being built by taxonomists and they specialise in particular groups. But I'd argue that this is not the most efficient way to tackle the problem. Because the taxonomic literature is so widely dispersed, after the obvious "low hanging fruit" have been collected, considerable effort must be spent tracking down the harder to find citations. There are few economies of scale in this approach. In contrast, if we focus on publications at, say, the level of journal, then we can build a bibliography much more quickly. Once we've found the source, say, for one article, often we could use that information to harvest many articles from the same source (e.g., write scripts to harvest from a digital repository such as a DSpace server, or a digital library such as Gallica). But if we are focussed on a particular taxon, we will ignore the other articles in that journal ("what do I care about fish, I like turtles").

Put another way, if we imagine a taxa × publication matrix, then we can either go after rows (i.e., a bibliography for a specific taxonomic group), or columns (a list of articles in a specific journal). The article-based approach will be faster, albeit at the cost of finding articles that aren't necessarily relevant to taxonomy. This is why I'm spending what feels like far too much time harvesting article lists and uploading these to Mendeley. It is also one reason BHL has been so successful. They've simply gone after scanning the literature wholesale, rather than focussing on particular taxonomic groups.

TaxapublicationmatrixWikispecies logo enCrowd sourcing and Wikispecies
Crowd sourcing often strikes me as a euphemism for "we can't be bothered doing the tedious stuff, lets get the public to do it for us (plus it will look like we're engaged with the public)." I'm not denying can work, but I suspect it's not a magic bullet. Perhaps the best crowd sourcing is not to try and bring the crowd to a project, but go where the crowd has already gathered. In this case, an obvious crowd is the Wikispecies community. Working with the ION database for my Sherborn presentation, it's clear that the quality of bibliographic data in ION is variable, and rather poor for older references. In contrast, the reference lists on Wikispecies can be very good (e.g., the bibliography for George Boulenger). There are some issues with Wikispecies, notably the lack of a decent bibliographic template (unlike Wikipedia) so parsing references can be *cough* interesting, but there is scope here to use it to improve other databases. Citation matching can be a challenge, but in this case we have citations indexed by taxonomic name (in both ION and Wikispecies), which greatly reduces the scope of possible matches.

Summary
I think building the "bibliography of life" needs a combination of aggressive data gathering, and avoiding building additional tools unless absolutely needed. There are great tools and communities that can already be leveraged (e.g., Mendeley, Wikispecies), let's make use of them.

BHL needs to engage with publishers (and EOL needs to link to primary literature)

Browsing EOL I stumbled upon the recently described fish Protoanguilla palau, shown below in an image by rairaiken2011:
Palauan Primitive Cave Eel

Two things struck me, the first is that the EOL page for this fish gives absolutely no clue as to where you would to find out more about this fish (apart from an unclickable link to the Wikipedia page http://en.wikipedia.org/wiki/Protoanguilla - seriously, a link that isn't clickable?), despite the fact this fish has been recently described in an Open Access publication ("A 'living fossil eel (Anguilliformes: Protanguillidae, fam. nov.) from an undersea cave in Palau", http://dx.doi.org/10.1098/rspb.2011.1289).

Now that I've got my customary grumble about EOL out of the way, let's look at the article itself. On the first page of the PDF it states:
This article cites 29 articles, 7 of which can be accessed free
http://rspb.royalsocietypublishing.org/content/early/2011/09/16/rspb.2011.1289.full.html#ref-list-1

So 22 of the articles or books cited in this paper are, apparently, not freely available. However, looking at the list of literature cited it becomes obvious that rather more of these citations are available online than we might think. For example, there are articles that are in the Biodiversity Heritage Library (BHL), e.g.


Then there are articles that are available in other digitising projects

  • Hay O. P. 1903 On a collection of Upper Cretaceous fishes from Mount Lebanon, Syria, with descriptions of four new genera and nineteen new species. Bull. Am. Mus. Nat. Hist. N. Y. 19, 395–452. http://hdl.handle.net/2246/1500
  • Nelson G. J. 1966 Gill arches of fishes of the order Anguilliformes. Pac. Sci. 20, 391–408. http://hdl.handle.net/10125/7805

Furthermore, there are articles that aren't necessarily free, but which have been digitised and have DOIs that have been missed by the publisher, such as the Regan paper above, and


So, the Proceedings of the Royal Society has underestimated just how many citations the reader can view online. The problem, of course, is how does a publisher discover these additional citations? Some have been missed because of sloppy bibliographic data. The missing DOIs are probably because the Regan citation lacks a volume number, and the Trewavas paper uses a different volume number to that used by Wiley (who digitised Proc. Zool. Soc. Lond.). But the content in BHL and other digital archives will be missed because finding these is not part of a publisher's normal workflow. Typically citations are matched by using services ultimately provided by CrossRef, and the bulk of BHL content is not in CrossRef.

So it seems there's an opportunity here for someone to provide a service for publishers that adds value to their content in at least three ways:
  1. Add missing DOIs due to problematic citations for older literature
  2. Add links to BHL content
  3. Add links to content in additional digitisation projects, such as journal archives in DSpace respositories

For readers this would enhance their experience (more of the literature becomes accessible to them), and for BHL and the repositories it will drive more readers to those repositories (how many people reading the paper on Protoanguilla palau have even heard of BHL?). I've said most of this before, but I really think there's an opportunity here to provide services to the publishing industry, and we don't seem to be grasping it yet.

Adding article-level metadata to BHL

Recently I've been thinking about the best ways to make article-level metadata from BioStor more widely available. For example, for someone visiting the BHL site there is no easy way to find articles, which are the basic unit for much of the scientific literature. How hard would it be to add articles to BHL? In the past I've wanted an all-singing all dancing article-level interface to BHL content (sort of BioStor on steroids), but that's a way off, and ideally would have a broader scope than BHL. So instead I've been thinking of ways to add articles to BHL without requiring a lot of re-engineering of BHL itself.

Looking at other digital archive projects like Gallica and Google Books it strikes me that if the BHL interface to a scanned item had a "Contents" drop down menu then users would be able to go to individual articles very easily. Below is a screen shot of how Gallica does this (see http://gallica.bnf.fr/ark:/12148/bpt6k61331684/f57).

Gallica

There's also a screen shot of something similar in Google Books (see http://books.google.co.uk/books?id=PkvoRnAM6WUC)

Contents

The idea would be that if BioStor had found articles within a scanned item, they would be listed in the contents menu (title, author, starting page), and if the user clicked on the article title then the BHL viewer would jump to that page. If there were no known articles, but the scanned item had a table of contents flagged (e.g., http://www.biodiversitylibrary.org/item/25703) then the menu could function as a button that takes you to that page. If there are no articles or contents, then the menu could be grayed out, or simply not displayed. This way the interface would work for books, monographs, and journal volumes.

Now, admittedly this is not the most elegant interface, and it treats articles as fragments of books rather than individual units, but it would be a start. It would also require minimal effort both on the part of BHL (who need to add the contents button), and myself (it would be easy to create a dump of the article titles indexed by scanned item).

More BHL app ideas

Hero rosellasFollowing on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elywreplied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiversitylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

n24_w1150Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:
  1. Use BHL web server logs to find and extract referrals from those projects
  2. Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.





Suggested apps for BHL's Life and Literature Code Challenge


Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

- Posted using BlogPress from my iPad

Correcting OCR using hOCR in Firefox

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output. Given my interest in OCR and the Biodiversity Heritage Library I decided to take it for a spin.

moz-hocr-edit uses the hOCR, which is a format for representing the output of OCR software, and is used by tools such as OCRopus (you can see the public specification for hOCR here). Basically it's a microformat, that is, it's HTML with some additional tags. Given some hOCR, moz-hocr-edit enables you to edit the OCR output line-by-line.

Demo
I've created a simple demo based upon Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation. For the demo to work you will need to use the Firefox web browser with the moz-hocr-edit installed.

  1. Go to http://dl.dropbox.com/u/639486/hocr/80780.html
  2. You will see a simple HTML representation of the OCR text from "Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation". I created this HTML from the original ABBYY FineReader XML from the Internet Archive.
  3. On the bottom right-hand of the Firefox browser window you should see hOCR. Click on it and select "Edit this hOCR document":
    Statusbar
  4. Firefox will open a new tab that will look something like this:
    Screenshot
  5. You can now edit individual lines of text, and see your edits applied to the HTML below.
moz-hocr-edit is a neat little tool. With appropriate web server settings (and, as the tool's author Jim Garrison suggests, autoversioning) it could the basis of a great tool for correcting OCR errors in BHL.