Search this keyword

In defence of OpenURL: making bibliographic metadata hackable

This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:



This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS).

This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users.

Ed wrote:

I prefer to encourage publishers to use HTML's metadata facilities using the <meta> tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done.

That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines:
Metadata1
Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy).

But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links.

If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself.

Metadata2
This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI.

So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches?

Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet:

If you publish bibliographic data and don't use COinS you are doing it wrong

Bibliographic metadata pollution

I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:

Google

If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself.

One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications).

As Alan Ruttenberg (@alanruttenbergpoints out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff".

On Names Attribution, Rights, and Licensing of taxonomic names

Few things have annoyed be as much as the following post on TAXACOM:

The Global Names project will host a workshop to explore options and to make recommendations as to issues that relate to Attribution, Rights and Licensing of names and compilations of names. The aim of the workshop is a report that clarifies if and how we share names.

We seek submissions from all interested parties - nomenclaturalists, taxonomists, aggregators, and users of names. Let us know what (you think) intellectual property rights apply or what rights should be associated with names and compilations of names. How can those who compile names get useful attribution for names, and what responsibilities do they have to ensure that information is authoritative. If there are rights, what kind of licensing is appropriate.

Contributions can be submitted http://names-attribution-rights-and-licensing.wikia.com/wiki/Main_Page, where you will find more information about this event.

I'm trying to work out why this seemingly innocuous post made me so mad. I think this is because I think this fundamentally framing the question the wrong way. Surely the goal is to have a list of names that is global in scope, well documented, and freely usable by all without restriction? Surely we want open and free access to fundamental biodiversity data? In which case, can we please stop having meetings and get on with making this so?

If you frame the discussion as one of "Attribution, Rights and Licensing of names and compilations of names" then you've already lost sight of the prize. You've focussed on the presumed "rights" of name compilers instead.

I would argue that names compilations are somewhat overvalued. They are basically lists of names, sometimes (all to rarely) with some degree of provenance (e.g., a citation to the original use of the name). As I've documented before (e.g., More fictional taxa and the myth of the expert taxonomic database and Fictional taxa) entirely fictional can end up in taxonomic databases with alarming ease. So any claims that these are expert-curated lists should be taken with a pinch of salt.

Furthermore, it is increasingly easy to automate building these lists, given that we have tools for finding names in text, and an ever expanding volume of digitised text becoming available. Indeed, in an ideal world where all taxonomic literature was digitised much of the rationale for taxonomic name databases would disappear (in the same way that library card catalogues are irrelevant in the age of Google). We are fast approaching the point where we can do better than experts. To give just one example, in a recent BHL interview with Gary Poore it was stated that:

For example, the name widely used name Pentastomida itself was widely attributed to Diesing, 1836, but the word did not appear in the literature until 1905.


A quick check of Google Ngrams shows this to be simply false:

Pentastomida

I don't need taxonomic expertise to see this, I simply need decent text indexing. So, if you have a list of names, you have something that it will soon be largely possible to recreate using automated methods (i.e., text mining). With a little sophistication we could mine the literature for further details, such as synonymy, etc. Annotation and clarification of a few "edge cases" where things get tricky will always be needed, but if you want to argue that your lists deserves "Attribution, Rights and Licensing" then you fail to realise that your list is going to be increasing easy to recreate simply by crawling the web.

It seems to me that most taxonomic databases are little more than digitised 5x3 index cards, and lack any details on the provenance of the names they contain. They often don't have links to the primary literature, and if they do cite that literature they typically do so in a way that makes it hard to find the actual publication. I once gave a talk which included the slide below showing taxonomic databases as being "in the way" between taxonomists and users of taxonomic information:

Users

In the old days building taxonomic databases required expertise and access to obscure, hard to find, physical literature. A catalogue of names was a way to summarise that information (since we couldn't share access). Now we are in an age where more and more primary taxonomic information is available to all, which removes most of the rationale for taxonomic databases. Users can go directly to taxonomic information themselves, which mean they can get the "good stuff", and maybe even cite it (giving us provenance and credit, which I regard as basically the same thing). In many ways taxonomic databases are transitional phenomena (like phone directories, remember those), and one could argue are now in the way of the taxonomists' Holy Grail, getting their work cited.

Lastly, any discussion of "Attribution, Rights and Licensing of names and compilations of names" reflects one of the great self inflicted wounds of biodiversity informatics, namely the reluctance to freely share data. As we speak terrabytes of genomics data are whizzing around the planet, people are downloading entire copies of GenBank and creating new databases. All of this without people fussing over "Attribution, Rights and Licensing." It's time for taxonomic databases to get over themselves and focus on making biodiversity data as accessible and available as genomics data.