Search this keyword

Biodiversity Heritage Library, Google books, and metadata quality

I've been playing recently with the Biodiversity Heritage Library (BHL), and am starting to get a sense for the complexities (and limitations) of the metadata BHL stores about publications. The more I look at BHL the more I think the resource is (a) wonderfully useful and (b) hampered by some dodgy metadata.

The BHL data model has three kinds of entities, "Titles", "Items", and "Pages". Pages are individual pages in an item, where an item which corresponds to a physical object that has been scanned (such as a book or a bound volume of a journal). A title may comprise a single item, such as book, or many items, such as volumes of a journal. Most of the metadata BHL has relates to physical items (books and bound volume issues), as opposed to article-level metadata, which is basically absent (see But where are the articles?).

bhl_model.png


This model reflects the sources of the BHL metadata (library catalogues) and the mode of operation (bulk scanning of bound volumes). But it can make working out dates of somewhat challenging.

To give an example, I did a search on the frog name Hyla rivularis Taylor, 1952 (NameBankID 27357), currently known as Isthmohyla rivularis. I wanted to find the original description of this frog. A BHL search returns 34 pages containing the name Hyla rivularis, distributed over 5 titles (a title in BHL may be a book, or a journal). Given that the name was published in 1952, it would be nice if I could sort these results by date, and then look at items from 1952. Unfortunately I can't. BHL has limited information on dates, especially at the level I would need to find a document published in 1952.

For the five titles returned in the search, I have dates for four of them, albeit two are ranges (University of Kansas publications, Museum of Natural History, 1946-1971, and The University of Kansas science bulletin, 1902-1996). At the level of individual items, only item 25858 (University of Kansas publications, Museum of Natural History) has dates (1961-1966). If I look at the VolumeInfo field for an item (you can get this from the database dump, or using the JSON web service) I sometimes get strings like this "v.35:pt.1 (1952)". This item (25857) is the one I'm after, but the date is buried in the VolumeInfo string. So, the information I need is there, but it's going to need some parsing.

84203b53fc75bd5bbc7f6a62fe8500f1.jpeg

Another issue is that of duplicates. Searching for publications on Rana grahamii, I found items 41040 and 45847. Although one item is treated as a book, and the other as a volume of the journal Records of the Indian Museum, these are the same thing. Having duplicates is a complication, but it might also be useful for quality control and testing (for example, do taxon name extraction algorithms return the same names from OCR text from both copies?). Nor is having duplicate copies and/or identifiers unique to BHL. The Records of the Indian Museum has a series-level identifier (ISSN 0537-0744), and this article ("A monograph of the South Asian, Papuan, Melanesian and Australian frogs of the genus Rana") also as the ISBN 8121104327.

There are parallels with Google books scanning project, which has been the subject of criticism on several fronts, including the quality of the metadata they have for each book. Geoff Nunberg has an entertaining post entitled Google Books: A Metadata Train Wreck which lists many examples of errors. This blog post also contains a detailed response from Jon Orwant of Google books. In essence, Google books is riddled with metadata errors (such as books on the Internet with publication dates predating the birth of their authors), but most of these errors have come from library catalogues (not unexpected given the scale of the task), not Google.

What could BHL do about its metadata? One thing is crowdsourcing. BHL does a little of this already, for example capturing user-provided metadata when PDFs are created, but I wonder if we could do more. For example, imagine dumping metadata for all 39,000 items into a semantic wiki and inviting people to edit and annotate the metadata. This could be extended to adding article boundaries (i.e., identifying which page corresponds to the start of an article). There is also considerable scope for trying to find article boundaries using existing metadata from bibliographies assembled by individual scientists.

But we should watch closely what Google does with its book project. Eric Hellman has argued that, far from creating the metadata mess, Google is ideally positioned to sort it out. He writes:
What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so?

Towards a wiki of phylogenies

At the start of this week I took part in a biodiversity informatics workshop at the Naturhistoriska riksmuseets, organised by Kevin Holston. It was a fun experience, and Kevin was a great host, going out of his way to make sure myself and other contributors were looked after. I gave my usual pitch along the lines of "if you're not online you don't exist", and talked about iSpecies, identifiers, and wikis.

I also ran a short, not terribly successful exercise using iTaxon to demo what semantic wikis can do. As is often the case with something that hasn't been polished yet, the students found the mechanics of doing things less than intuitive. I need to do a lot of work making data input easier (to date I've focussed on automated adding of data, and forms to edit existing data). Adding data is easy if you know how, but the user needs to know more than they really should have to.

The exercise was to take some frog taxa from the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2) and link them to GenBank sequences and museum specimens. The hope was that by making these links new information would emerge. You could think of it as an editable version of this. With a bit of post-exercise tidying, we got someway there. The wiki page for the Frost et al.
paper
now shows a list of sequences from that paper (not all, I hasten to add), and a map for those sequences that the students added to the wiki:

frost.png


Although much remains to be done, I can't help thinking that this approach would work well for a database like TreeBASE, where one really needs to add a lot of annotation to make it useful (for example, mapping OTUs to taxon names, linking data to sequences and specimens). So, one of the things I'm going to look at is dumping a copy of TreeBASE (complete with trees) into the wiki and seeing what can be done with it. Oh, and I need to make it much, much easier for people to add data.

When taxonomists wage war in Wikipedia

Stumbled across Alex Wild's post Pyramica vs Strumigenys: why does it matter?, which takes as it's starting point a minor edit war on the Wikipedia page for Pyramica .

Alex gives the background to the argument about whether Pyramica is a synonym of Strumigenys, and investigates the issue using the surprisingly small about of data available in GenBank. The tree he found (shown below) suggests this issue will require some work to resolve:

phylogeny1.jpg


For fun I constructed a history flow diagram for the edits to the Pyramica page in Wikipedia:

5.png


The diagram shows the two occasions when the page has been striped of content (and subsequently restored) as contributors dispute whether Pyramica is a synonym of Strumigenys. It would be useful to have one or more metrics of how controversial a page (and/or a contributor) was, to both identify controversial pages, and to see how controversial taxonomic pages were compared to other Wikipedia topics. The paper On Ranking Controversies in Wikipedia: Models and Evaluation by Ba-Quy Vuong et al. (doi:10.1145/1341531.1341556) would be a good place to start (a video of the presentation of this paper is available here).