Search this keyword

BioStor gets PDFs with XMP metadata - bibliographies made easy with Mendeley and Papers

The ability to create PDFs for the articles BioStor extracts from the Biodiversity Heritage Library has been the single most requested feature for BioStor. I've taken a while to get around to this -- for a bunch of reasons -- but I've finally added it today. You can get a PDF of an article by either clicking on the PDF link on the page for an article, or by appending ".pdf" to the article URL (e.g., http://biostor.org/reference/570.pdf). In some ways the BioStor PDFs are pretty basic - they contain page images, not the OCR text, so they tend to be quite large and you can't search for text within the article. But what they do have is XMP metadata.

XMP metadata
7950C5D8-3299-48CF-96A0-9D02BA6F0B03.jpgOne of the great bugbears about organising bibliographies is the lack of embedded metadata in PDFs, in other words Why can't I manage academic papers like MP3s? (see my earlier post for some background). Music files and digital photos contain embedded metadata that store information such as song title and artist in the case of music, or date, exposure, camera model, and GPS co-ordinates in the case of digital images. This means software (and webs sites such as Flickr) can automatically organise your collection of media based on this embedded metadata.

mendeley.pngpapers.pngWouldn't it be great if there was an equivalent for PDFs of papers, whereby the PDF contains all the relevant the bibliographic details (article title, authorship, journal, volume, pages, etc.), and reference managing software could read this and automatically put the PDF into whatever categories you chose (e.g., by author, journal, or date)? Well, at least two software programs can do this, namely the cross-platform Mendeley, and Papers, which supports Apple's Macintosh, iPhone, and iPad platforms. Both programs can read bibliographic metadata in Adobe's Extensible Metadata Platform (XMP), which has been adopted by journals such as Nature, and CrossRef has recently been experimenting in providing services to add XMP to PDFs.

One reason I put off adding PDFs to BioStor was the issue of simply generating dumb PDFs for which users would then have to retype the corresponding bibliographic metadata if they wanted to store the PDF in a reference manager. However, given that both Papers and Mendeley support XMP, you can simply drag the PDF on to either program and they will extract the details for you (including a list of up to 10 taxonomic names found in the article). Both Papers and Mendeley support the notion of a "watched folder" where you can dump PDFs and they will "automagically" appear in your reference manager's library. Hence, if you use either program you should be able to simply download PDFs from BioStor and add them to your library without having to retype anything at all.

Technical details
This post is being written as I'm waiting to catch a plane, so I haven't time to go into all the gory details. The basic tools I used to construct the PDfs were FPDF and ExifTool, which supports injecting XMP into PDFs (I couldn't find another free tool that could insert XMP into a PDF that didn't already have any XMP metadata). I store basic Dublin Core and PRISM metadata in the PDF. The ten most common taxonomic names found in the pages of the article are stored as subject tags.

FA644E5D-F2DE-4B00-BADD-140806C4EE00.jpgInitially it appeared that only Papers could extract XMP, Mendeley failed completely (somewhat confirming my prejudices about Mendeley). However, I sent an example PDF to Mendeley support, and they helpfully diagnosed the problem. Because XMP metadata can't always be trusted, Mendeley compares title and author values in the XMP metadata with text on the first couple of pages of the PDF. If they match, then the program accepts the XMP metadata. Because my initial efforts as creating PDfs just contained the BHL page images and no text, they wouldn't pass Mendeley's tests. Hence, I added a cover page containing the basic bibliographic metadata for the article, and now Mendeley is happy (the program itself is growing on me, but if you're an Apple fanboy like me, Papers has native look and feel, and syncing your library with your iPhone is a killer feature). There are a few minor differences in how Papers and Mendeley handle tags. Papers will take text in the Dublin Core "Subject" tag and use those as keywords, whereas to get Mendeley to extract tags I had to store them using the "Keywords" tag (implemented using FPDF's SetKeywords function). But after a bit of fussing I think the BioStor PDFs should play nice in both programs.

TreeBASE II has been released


The TreeBASE team have announced that TreeBASE II has been released. I've put part of the announcement on the SSB web site. Given that TreeBASE and I have history, I think it best to keep quiet and see what others think before blogging about it in detail. Plus, there's a lot of new features to explore. Take it for a spin and see what you think.

Wiki frustration

Yesterday I fired off a stream of tweets, starting with:
OK, I'm back to hatin' on the wikis. It's just way to hard to do useful queries (e.g., anything that requires a path linking some entities) [10991434609]

Various people commented on this, either on twitter or in emails, e.g.:
@rdmpage hatin' wikis is irrational - all methods have pros and cons, so wikis are an essential resource among others [@stho002]

So, to clarify, I'm not abandoning wikis. I'm just frustrated with the limitations of Semantic Mediawiki (SMW). Now, SMW is a great piece of software with some cool features. For example,

  1. Storing data in Mediawiki templates (i.e., key-value pairs) makes it rather like a JSON database with version control (shades of CouchDB et al.).

  2. Having Mediawiki underlying SMW makes lots of nice editing and user management features available for free.

  3. The template language makes it relatively easy to format pages.

  4. It is easy to create forms for data entry, so users aren't confronted with editing Mediawiki templates unless they want to.

  5. Supports basic queries.


It's the last item on the list that is causing me grief. The queries are, well, basic. What SMW excels at are queries connecting one page to another. For example, if I create wiki pages for publications, and list the author of each publication, the page for an author can contain a simple query (of the form {{ #ask: [[maker::Author 1]] | }}) that lists all the publications of that author:

q1.png


That's great, and many of the things are want to do can be expressed in this simple way (i.e., find all pages of a certain kind that link to the current page). It's when I want to go more than one page away that things start to go pear shaped. For example, for a given taxon I want to display a map of where it is found, based things like geoferenced GenBank sequences, or sequences from georeferenced museum specimens.

q2.png
This can involve a path between several entities ("pages" in SMW), and this just doesn't seem to be possible. I've managed to get some queries working via baroque Mediawiki templates, but it's getting to the point where hours seem to be wasted trying to figure this stuff out.

So, what tends to happen is I develop something, hit this brick wall, then go and do something else. I suspect that one way forward is to use SMW as the tool to edit data and display basic links, then use another tool to do the deeper queries. This is a bit like what I'm exploring with BioStor, where I've written my own interface and queries over a simple MySQL database, but I'm looking into SMW + Wikisource to handle annotation.

This leaves the question of how to move forward with http://iphylo.org/treebase/? One approach would be to harvest the SMW pages regularly (e.g., by consuming the RSS feed, and either pulling off SMW RDF or parsing the template source code for the pages), use this to populate a database (say, a key-value store or a triple store) where more sophisticated queries can be developed. I guess one could either make this a separate interface, or develop it as a SMW extension, so results could be viewed within the wiki pages. Both approaches have merits. Having a complete separate tool that harvests the phylogeny wiki seems attractive, and in many ways is an obvious direction for iSpecies to take. Imagine iSpecies rebuilt as an RDF aggregator, where all manner of data about a taxon (or sequences, or specimen, or publication, or person) could be displayed in one place, but the task of data cleaning took place elsewhere.

Food for thought. And, given that it seems some people are wondering why on earth I bother with this stuff, and can't I just finish TreeView X, I could always go back to fussing with C++…