Search this keyword

Elsevier Grand Challenge paper out

CB88EB6F-75CD-485D-8A3D-5F43D9EE2B37.jpgAt long last the peer-reviewed version of the paper "Enhanced display of scientific articles using extended metadata" (doi:10.1016/j.websem.2010.03.004), in which I describe my entry in the Elsevier Grand Challenge, has finally appeared in the journal Web Semantics: Science, Services and Agents on the World Wide Web. The pre-print version of this paper has been online (hdl:10101/npre.2009.3173.1) for a year prior to appearance of the published version (24 April 2009 versus 3 April 2010), and the Challenge entry itself went online in December 2008. Unfortunately the published version has an awful typo in the title (that was in neither the manuscript nor the proofs).

Given this typo, the time lag between doing the work, writing the manuscript, and seeing it published, and the fact that I've already been to meetings where my invitation has been based the entry and the pre-print, I do wonder why on Earth would I bother with traditional publication (which is somewhat ironic, given the topic of the paper)?

Biodiversity informatics = #fail (and what to do about it)

The context for this post is the PLos markup meeting held at the California Academy of Sciences over the weekend (many thanks to Brian Fisher for the invitation). PLoS are launching a "biodiversity hub" and were looking for ideas on how to implement this. The fact that nobody -- least of all those attending from PLoS -- could adequately explain what a hub was made things a tad tricky, but that didn't matter, because PLoS did know when the first iteration of the hub was going live (later this summer). So, once we got past the fact that PLoS operates with a timeline that says "cool stuff will happen here" then sets about figuring what that cool stuff will actually be (in retrospect you gotta admire this approach), we then tried to figure out what PLoS needed from us.

That's when things got messy. It became very clear that PLoS wanted basic things like, you know, information on names, being able to link to specimens, etc., and our community can't do this, at least not yet. Nor can we provide simple answers to simple questions. For example, Rich Pyle, gave an overview of taxonomic names, nomenclature, concepts, and the horrendous alphabet soup of databases (uBio, ZooBank, IPNI, IndexFungorum, GNA, GNUB, GNI, CoL, etc.) that have a stake in this. You could see the look of horror in the eyes of the PLoS developers who were tasked with making the hub happen ("run away, run away now"). And this was after the simple version of things. In a week where taxonomy was in the news because of the possibility that Drosophila melanogaster would have to, *cough*, change its name (doi:10.1038/464825a)1, this was not a great start.

At each step when we outlined some of the stuff that would be cool, it became clear we couldn't deliver what we were actually arguing PLoS should do. For example, we have millions of digitised specimen records, and lots of papers refer to these specimens by name, but because individual specimens don't have URIs we can't refer to them (instead we have horrific query interfaces like TAPIR, see Accessing specimens using TAPIR or, why do we make this so hard?). We're digitising the taxonomic literature, but don't provide a way to link this to modern literature at the level of granularity publishers use (i.e., articles).

Readers of this blog will have heard this all before, but what made this meeting different was we actually had a "customer" rock up and ask for our help to enhance their content and create something useful for the community...and the best we could do was um and er and confess we couldn't really give them what they wanted2.

Think of the children
It's time biodiversity informatics stopped playing "let's make an acronym", stopped trying to keep taxonomists happy (face it, that's never going to happen, and frankly, they'll be extinct soon anyway), and stopped obsessing with who owns the data, and instead focus on delivering some simple, solid, services that address the needs of people who, you know, will actually do something useful with them. Otherwise we'll be like digital librarians, who thought people would search the way librarians do, then got their nose out of joint when Google ate their lunch.

It's time to make some simple services, and stop the endless cycle of inward looking meetings where we talk to each other. We need to learn to hide what people don't need (nor want) to see. We need to be able to:

  1. Extract entities from text, e.g. scientific names, specimen codes, localities, GenBank accession numbers.

  2. Lookup a taxonomic name and return basic information about that name (rather like iSpecies but as a service).

  3. Make specimen codes resolvable.

  4. Make taxonomic literature accessible using identifiers and tools publishers know about (that means DOIs and OpenURL).


We're close to a lot of this already, but we're still far enough away to make some of this non-trivial. And we keep having meetings about this stuff, and fail to actually get it done. Something is wrong somewhere when E O Wilson has his name on yet another call for megabucks for a biodiversity project (the "Barometer of Life, doi:10.1126/science.1188606). At what point will someone ask "um, we've given you guys a lot of money already, why can't you tell me the stuff we need to know?"

Let me just say that I'm a short term pessimist, but a long term optimist. The things I complain about will get fixed, one day. It's just that I see little evidence they'll get fixed by us. Prove me wrong, go on, I dare you...

  1. Personally I'm intensely relaxed about Drosophila melanogaster remaining Drosophila melanogaster, even if it ends up in a clade surrounded by flies with other generic names. Having (a) a stable name and (b) knowing where it fits in the tree of life is all we need to do science.

  2. At the meeting I couldn't stop thinking of the scene in The West Wing where President Bartlett walks up to the Capitol for an impromptu meeting with the Speaker of the House to sort out the budget, and is left waiting outside while the Speaker sorts out his game plan. By the time the Speaker is ready, the President has turned on his heels and left, making the Speaker look a tad foolish.


BioStor gets PDFs with XMP metadata - bibliographies made easy with Mendeley and Papers

The ability to create PDFs for the articles BioStor extracts from the Biodiversity Heritage Library has been the single most requested feature for BioStor. I've taken a while to get around to this -- for a bunch of reasons -- but I've finally added it today. You can get a PDF of an article by either clicking on the PDF link on the page for an article, or by appending ".pdf" to the article URL (e.g., http://biostor.org/reference/570.pdf). In some ways the BioStor PDFs are pretty basic - they contain page images, not the OCR text, so they tend to be quite large and you can't search for text within the article. But what they do have is XMP metadata.

XMP metadata
7950C5D8-3299-48CF-96A0-9D02BA6F0B03.jpgOne of the great bugbears about organising bibliographies is the lack of embedded metadata in PDFs, in other words Why can't I manage academic papers like MP3s? (see my earlier post for some background). Music files and digital photos contain embedded metadata that store information such as song title and artist in the case of music, or date, exposure, camera model, and GPS co-ordinates in the case of digital images. This means software (and webs sites such as Flickr) can automatically organise your collection of media based on this embedded metadata.

mendeley.pngpapers.pngWouldn't it be great if there was an equivalent for PDFs of papers, whereby the PDF contains all the relevant the bibliographic details (article title, authorship, journal, volume, pages, etc.), and reference managing software could read this and automatically put the PDF into whatever categories you chose (e.g., by author, journal, or date)? Well, at least two software programs can do this, namely the cross-platform Mendeley, and Papers, which supports Apple's Macintosh, iPhone, and iPad platforms. Both programs can read bibliographic metadata in Adobe's Extensible Metadata Platform (XMP), which has been adopted by journals such as Nature, and CrossRef has recently been experimenting in providing services to add XMP to PDFs.

One reason I put off adding PDFs to BioStor was the issue of simply generating dumb PDFs for which users would then have to retype the corresponding bibliographic metadata if they wanted to store the PDF in a reference manager. However, given that both Papers and Mendeley support XMP, you can simply drag the PDF on to either program and they will extract the details for you (including a list of up to 10 taxonomic names found in the article). Both Papers and Mendeley support the notion of a "watched folder" where you can dump PDFs and they will "automagically" appear in your reference manager's library. Hence, if you use either program you should be able to simply download PDFs from BioStor and add them to your library without having to retype anything at all.

Technical details
This post is being written as I'm waiting to catch a plane, so I haven't time to go into all the gory details. The basic tools I used to construct the PDfs were FPDF and ExifTool, which supports injecting XMP into PDFs (I couldn't find another free tool that could insert XMP into a PDF that didn't already have any XMP metadata). I store basic Dublin Core and PRISM metadata in the PDF. The ten most common taxonomic names found in the pages of the article are stored as subject tags.

FA644E5D-F2DE-4B00-BADD-140806C4EE00.jpgInitially it appeared that only Papers could extract XMP, Mendeley failed completely (somewhat confirming my prejudices about Mendeley). However, I sent an example PDF to Mendeley support, and they helpfully diagnosed the problem. Because XMP metadata can't always be trusted, Mendeley compares title and author values in the XMP metadata with text on the first couple of pages of the PDF. If they match, then the program accepts the XMP metadata. Because my initial efforts as creating PDfs just contained the BHL page images and no text, they wouldn't pass Mendeley's tests. Hence, I added a cover page containing the basic bibliographic metadata for the article, and now Mendeley is happy (the program itself is growing on me, but if you're an Apple fanboy like me, Papers has native look and feel, and syncing your library with your iPhone is a killer feature). There are a few minor differences in how Papers and Mendeley handle tags. Papers will take text in the Dublin Core "Subject" tag and use those as keywords, whereas to get Mendeley to extract tags I had to store them using the "Keywords" tag (implemented using FPDF's SetKeywords function). But after a bit of fussing I think the BioStor PDFs should play nice in both programs.