Search this keyword

Does the legacy biodiversity literature matter?

I've just come back from a pro-iBiosphere Workshop at Leiden where the role of "legacy literature" became the subject of some discussion. This continued on Twitter as Ross Mounce (@rmounce) and I went back and forth:
Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:
MAY, R. M. (1988). How Many Species Are There on Earth? Science, 241(4872), 1441-1449. doi:10.1126/science.241.4872.1441
In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:
Jones, K. E., Bielby, J., Cardillo, M., Fritz, S. A., O Dell, J., Orme, C. D. L., & Purvis, A. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. (W. K. Michener, Ed.)Ecology, 90(9), 2648-2648. doi:10.1890/08-1494.1
PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:

Decade startingPublications
18401
18601
18901
190010
19104
192014
193048
194061
1950114
1960295
1970527
1980865
19901019
2000183
Pantheria The bulk of the papers came from the second half of the 20th century, and many of these are "legacy" in the sense that they are in archives like JSTOR, and hence the PDFs are based on scanned images and OCR. The oldest papers are from the 19th century, which is legacy by anyone's definition. My interpretation of this data is that even for a well-studied group such as mammals, the basic organismal-level data sought by comparative biologists is in the "legacy" literature. My suspicion is that if we attempt to build PanTHERIA-style databases for other, less well-studied taxa, the data (if it exists at all) will be found not in the modern literature (where the focus has long since moved on from the organism to genomics and system biology) but in the corpus of taxonomic and ecological literature that are being scanned and stored in digital archives.

Update
I've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.

More GBIF specimen identifier strangeness

Continuing the theme of trying to map specimens cited in the literature to the equivalent GBIF records, consider the GBIF record http://data.gbif.org/occurrences/685591320, which according to GBIF is specimen "ZFMK 188762" (a [sic] holotype of Praomys hartwigi).

This is odd, because the original publication of this name (Eisentraut, M. 1968 .Beitrag zur Saugetierfauna von Kamerun. Bonner Zoologische Beitraege, 19:1-14, see PDF below) gives the type (p. 11) as "Museum A. Koenig, Kat. Nr. 68. 7").



The GBIF record includes links to images of ZFMK 188762, such as http://www.biologie.uni-ulm.de/cgi-bin/imgobj.pl?sid=T&lang=e&id=102323.

Bild pl

If we open this link we see that specimen is listed as "ZFMK-68.7", which matches the original description. "ZFMK-68.7" is a link to http://www.biologie.uni-ulm.de/cgi-bin/herbar.pl?herbid=188762&sid=T&lang=e, which is the record for this specimen in the SysTax database.

Note that this URL includes the number 188762, which is treated as the catalogue number by GBIF (i.e., "ZFMK 188762"). So, it seems that in the data provided by SysTax the primary key in that database (188762) has become the catalogue number in GBIF (I tried to verify this by clicking on the original provider message on the GBIF page but it failed to produce anything). This means any naive attempt to locate the specimen "ZFMK-68.7" in GBIF is going to fail because the harvesting and indexing as conflated a local primary key with the catalogue number that appears in publications that refer to this specimen.

Sometimes I think we are doing our level best to make retrieving data as hard as possible...

Thoughts on Mendeley and Elsevier



The rumour that Elsevier is buying Mendeley has been greeted with a mixture of horror, anger, peppered with a few congratulations, I told you so's, and touting for new customers:



Here's some probably worthless speculation to add to the mix. Disclosure: I use Mendeley to manage 100,000's of references, and use the API for various projects. I'm not paying customer (but I do pay for some Internet services such as DropBox, BackBlaze, and Spotify, so it's not that I won't pay, it's just that the service Mendeley charge for doesn't interest me). I've published in Elsevier journals (most recently a couple of papers that, thanks to the efforts of Paul Craze, editor of TREE, are "free" in the sense you can download the PDF for free), and I took part in the Elsevier Grand Challenge.

So, given that I'm suitably compromised, here are some thoughts.

Elsevier suck


Elsevier are big, ugly, and at the corporate level are doing things that actively make researchers angry (see The Cost of Knowledge).

Elsevier rocks


Elsevier are one of the most innovative science publishers around. They fund challenges, are investing heavily in interactive and semantic markup of papers (for example, interactive phylogenies), and have built an app ecosystem on their publishing platform.

Mendeley sucks


Mendeley is suffering some from serious failings, most of which could be addressed with sufficient resources. The API sucks, mostly because Mendeley themselves don't actually use it. The Desktop client communicates with Mendeley's database using a different protocol, hence the API lacks the functionality needed to make truly great apps on the platform. The algorithms Mendeley use to de-duplicate their catalogue are flawed, occasionally creating entirely fictional entries.

Mendeley rocks


The way Mendeley engineered the creation of a bibliographic database in the cloud is genius, as is their recognition that the object around which scientists will cluster is the article, not the author. They helped foster the altmetrics movement, and have a great presence on Twitter and at conferences (i.e., you can talk to actual people who write code).

What happens next?


Let's assume that Elsevier does, indeed, buy Mendeley, and wants to do interesting things with Mendeley, and that Mendeley doesn't become one of the many startups that have a successful "exit" for the founders but ends up dying in the bosom of a larger company. Here are some possibilities.

Mendeley becomes iTunes for papers


Forget the "Last.fm" of papers, what about the "iTunes of papers"?. Big publishers are facing a revolt over the cost of institutional subscriptions, and journals are increasingly irrelevant as aggregations. The literature that people read is widely scattered across different outlets. Journals are archaic in the same way that music albums are mostly a thing of the past, people mix and match singles.

In the recent fight between UC Davis and Nature, Nature estimated that "CDL will be paying roughly $0.56 per download". So, why not charge a buck a paper? Mendeley's web interface is practically crying out for a "BUY THIS PAPER" button. Under this model, Elsevier has an outlet for its content that doesn't force people to subscribe to large amounts of stuff they don't want. Mendeley could be used to establish a relationship directly with paying customers, rather than institutions.

Mendeley becomes the de facto measure of research impact


But combining Mendeley's readership data with citations, Elsevier could construct powerful measures of research impact, bringing altmetrics into the mainstream. Couple this with links to institutions, and Elsevier could provide universities with all the data they need to evaluate academic performance (gulp).

Mendeley becomes an authoring tool


Managing references and inserting citations into manuscripts is one of the basic tasks facing an academic author. Authoring tools are evolving in the direction of being online, and embedding more semantic markup (e.g., these are taxon names, this is a chemical compound, this is a statement of causality). In a sense reference lists are the one form of structured markup we are already familiar with. Why not build on that and create an authoring platform?

Mendeley becomes the focus of post-publication review


Publishers have failed to crack the problem of post-publication review. Several provide the ability for readers to comment on an article online, but this has failed to take off. I think this is because the sociology is wrong, if you want a conversation you need to go where the people are, not expect them to come to you. Given that people are bookmaking papers in Mendeley, the next step is to get them to comment, or aggregate their annotations (in the same way that Amazon's Kindle can show you passages that others have highlighted).

Interesting times...