Search this keyword

Would you give me a grant? An experiment in Open Science

I would like to know what you think of a grant proposal I plan to submit to the UK Natural Environment Research Council at the end of the month. The proposal takes the notion of "dark taxa" explored in an earlier blog post and outlines three things I'd like to do:
  1. Quantify the extent of dark taxa (taxa in GenBank that don't have scientific names)
  2. Determine how many dark taxa are genuinely new species (as opposed to taxa that are known to science but simply haven't been labelled with their proper names)
  3. Explore what we can learn about a taxon's biology even if it lacks a scientific name (e.g., the "symbiome")

Given that I discuss most of my ideas on this blog, and deposit preprints in Nature Precedings before the corresponding manuscript is published, it seems a logical extension to make grant proposals open as well. So you view the proposal on Google Docs, and you can add comments, if you wish.



Any feedback or suggestions are welcome. Do you think this is fundable? Have I made a good case for the proposed research? Is it interesting, or is it obvious, or has it already been done? Let me know what you think.

ZooBank on CouchDB: UUIDs, replication, and embedding the literature in taxonomic databases

ZooBankBannerLast December I released a web site called Australian Faunal Directory on CouchDB, which was part of my ongoing exploration of how to build a simple yet useful database of taxonomic names. In particular, I want to link names directly to the primary taxonomic literature. No longer is it adequate to simply list names, or list names with mangled bibliographic details (I'm looking at you, Catalogue of Life). This is the 21st century, so I expect one click from name to literature, or at the most two (via, say, a DOI). Nothing else will cut it.

CouchbaseThe Australian Faunal Directory (AFD) was an eye opener as it was the first serious use I'd made of CouchDB (now CouchBase). I'd played with replicating and forking data in 2010: Catalogue of Life and CouchDB, but the AFD project was bigger, and also inspired me to use web hooks to make the database editable. Suddenly this stuff started to look easy: no schema, simple web services, and tiny amounts of code.

ZooBank
So then my attention turned to ZooBank, which is "the official registry of Zoological Nomenclature, according to the International Commission on Zoological Nomenclature (ICZN)." ZooBank was proposed by Polaszek et al. (2005) in a short piece in Nature ("A universal register for animal names", doi:10.1038/437477a). By providing a registry of names for animals, ultimately it aims to help avoid embarrassing situations such as the example I recount in my paper on BioStor (doi:10.1186/1471-2105-12-187): a recent paper in Nature published the name Leviathan for an extinct sperm whale with a giant bite (doi:10.1038/nature09067), only for authors to have to publish an erratum with a new name (doi:10.1038/nature09381) when it was discovered that Leviathan had already been used for an extinct mammoth.

ZooBank is developed and run by Rich Pyle, and has some nice features, such as RDF export (via LSIDs), but like most taxonomic databases it doesn't link directly to the literature. Where are the DOIs? Where are links to BHL? Where is the ability to add these links? And why is it almost entirely about fish? (OK, I know the answer to that one).

CouchDB
But the thing which really got me thinking about using CouchDB to create a version of ZooBank was Rich Pyle's vision of having a distributed ZooBank, and his insistence on using ugly UUIDs in ZooBank identifiers (e.g., urn:lsid:zoobank.org:act:6BBEF50E-76B4-42EF-97B1-7029DBCD8257). As much as they are ugly, Rich has always argued that they make distributed systems easy because you don't need a centralised system to assign unique identifiers.

Anybody who has played with CouchDB will know that CouchDB uses UUIDs by default to create identifiers for database documents. It also excels at data synchronisation, and can run on platforms large and small (including mobile such as Android and iOS). This means a database could be updated on an iPhone or iPad without an Internet connection, then the data could be synchronised with other databases. Indeed, I developed this CouchDB clone of ZooBank on my MacBook, then pointed it at CouchDB running on my server and within minutes had an exact copy of the database running on the server. This ease of replication, together with the joy of schema-less design makes CouchDB seem an obvious fit to ZooBank.

Demo
You can see the ZooBank on CouchDB demo here. It's not a complete copy of ZooBank, but has most of it. I reuse the UUIDs issued by ZooBank, so that

http://zoobank.org:80/?uuid=6bbef50e-76b4-42ef-97b1-7029dbcd8257

becomes

http://iphylo.org/~rpage/zoobank/6bbef50e-76b4-42ef-97b1-7029dbcd8257

As usual it's all a bit crude, but has some nice features, such as links to BHL content with a built in article viewer I wrote for the AFD project:

EtheostomaWhat's next?
At present only a fraction of the ZooBank references have external links, I hope to add more in the next few days, using both automatic scripts and the web hook interface. The search interface needs work, and being that ZooBank is about nomenclature and not taxonomy, it might be useful to add a classification (say from the Catalogue of Life) so that users can navigate around the names (and get a sense of how many are *cough* fish).

At present to display a reference I do one of four things:
  1. If reference is in BHL I use my article viewer
  2. If there is a freely available PDF online I display that using Google Docs PDF viewer
  3. If 1 and 2 don't apply, but there is a DOI then I resolve the DOI and display the result in an IFRAME (yuck)
  4. If none of 1-3 apply I display a blank rectangle

There are a couple ways we could improve this. The first is to enhance the display of BHL content by making use of the structure of the source DjVu files. Another is to make use of the XML now being made available by the journal Zookeys (see my blog post, and Pensoft's announcement that ZooKeys is now being archived by PubMed Central, complete with taxonomic markup). There are a lot of ZooKeys articles in ZooBank, so there's a lot of potential for embedding an article viewer that takes Zookeys XML and redisplays it with taxonomic names and references as clickable links that link to other ZooBank content. That way we approach the point where taxonomic literature becomes a first class citizen of a taxonomic database.

What is the best way to measure academic outputs that aren't publications?

My institute is going through various reviews of staff performance and, frankly, I'm feeling somewhat vulnerable given my somewhat unorthodox (at least amongst my colleagues) approach to doing science. I spend way more time writing code, building databases and web sites, and blogging than writing papers and getting grants (although I have been known to do both).

So the issue becomes, how to demonstrate that coding, building websites, and ranting on my blog is a worthwhile thing to do? Now, I'm happy that what I do has value, but my happiness isn't the issue. It's convincing people who want to see papers in high impact journals and bums on seats in labs that there's other ways to generate scientific output, and that output can have value. I'm also concerned that a simplistic view of what constitutes valid outputs will stifle innovation, just at the time when traditional science publishing is undergoing a revolution.

So, I posted a question on Quora:What is the best way to measure academic outputs that aren't publications?, where I wrote:
Usually we assess the quality of academic output using measures based on citations, either directly (how many papers have cited the paper?) or indirectly (is the paper published in a journal like Nature or Science that contains papers that on average get lots of citations, i.e. "impact factor"). But what of other outputs, such as web sites, databases, and software? These outputs often require considerable work, and can be widely used. What is the best way to measure those outputs?


There have been various approaches to measuring the impact of an article other than using citations, such as the number of article downloads, or the number of times an article has been bookmarked on a site such as Mendeley or CiteULike. But what of the coding, the database development, the web sites, and the blog posts. How can I show that these have value?

I guess there are two things here. One is the need to be able to compare across outputs, which is tricky (comparing citations across different disciplines is already hard), the other is the need to be able to compare within broadly similar outputs. Here are some quick thoughts:

Web sites
An obvious approach is to use Google Analytics to harvest information about page views and visitor numbers. The geographic origin of those visitors could be used to make a case for whether the research/data on that site is internationally relevant, although I suspect "internationally relevant" is a somewhat suspect notion. Most academic specialities are narrow, such that the person most interested in your research is likely living in a different country, hence by definition most research will be internationally "relevant".

The advantage of Google Analytics is that it is widely used, hence you could get comparative data and be able to show that your web site is more (or less) used that another site.

Code
The value of code is tricky, but tools like ohloh provide estimates of the effort and expense required to generate code for a project. For example, for my bioGUID code repository (which includes code for bioGUID and BioStor, as well as some third party code) ohloh's estimated cost is 87 person-years and $US 4,784,203. OK, silly numbers, but at least I can compare these with other projects (Drupal, for example, represents 153 years and $US 8,438,417 of investment).


Comparing across output categories will be challenging, especially as there is no obvious equivalent for citation (one reason why if you develop software or a web site it makes good sense to write a paper describing it, worked for me). But perhaps download or article access statistics could provide a way to say "my web site is worth x publications. Note also that I'm not arguing that any of these measures is actually a good thing, just that if I'm going to be measured, and I have some say in how I'm measured, I'd like to suggest something sensible that others might actually buy.

So, please feel free to comment either here or on Quora . I need to put together some notes to make the case that people like me aren't just sitting drinking coffee, playing loud music, and tweeting without, you know, actually making stuff.