Search this keyword

Data matters but do data sets?

Interest in archiving data and data publication is growing, as evidenced by projects such as Dryad, and earlier tools such as TreeBASE. But I can't help wondering whether this is a little misguided. I think the issues are granularity and reuse.

Taking the second issue first, how much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses.

Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much").

But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.

To me, citing data sets makes almost as much sense as citing journal volumes - the level of granularity is wrong. Journal volumes are largely arbitrary collections of articles, it's the articles that are the typical unit of citation. Likewise I think sequences will be cited more often than alignments.

It might be argued that there are disciplines where the dataset is the sensible unit, such as an ecological study of a particular species. Such a data set may lack obvious subsets, and hence it makes sense to be cited as a unit. But my expectation here is that such datasets will see limited re-use, for the very reason that they can't be easily partitioned and mashed up. Data sets, such as alignments, are built from smaller, reusable units of data (i.e., sequences) can be recombined, trimmed, or merged, and hence can be readily re-used. Monolithic datasets with largely unique content can't be easily mashed up with other data.

Hence, my suspicion is that many data sets in digital archives will gather digital dust, and anyone submitting a data set in the expectation that it will be cited may turn out to be disappointed.

Mendeley and Web Hooks

Quick, poorly thought out idea. I've argued before that Mendeley seems the obvious tool to build a "bibliography of life." It has pretty much all the features we need: nice editing tools, support for DOIs, PubMed identifiers, social networking, etc.

But there's one thing it lacks. There's not an easy way to transmit updates from Mendeley to another database. There are RSS feeds for groups, such as this one for the "Museum Type Catalogues" group, but that just lists recently added articles. What if I edit an article, say by correcting the authorship, or adding a DOI? How can I get those edits into databases downstream?

One way would be if Mendeley provided RSS feeds for each article, and these feeds would list the edits made to that article. But polling thousands of individual RSS feeds would be a hassle. Perhaps we could have a user-level RSS feed of edits made?

But another way to do this would be with web hooks, which I explored earlier in connection with updating literature within a taxonomic database. The idea is as follows:
  1. I have a taxonomic database that contains literature. It also has a web hook where I can tell the database that a record has been edited elsewhere.
  2. I edit my Mendeley library using the desktop client.
  3. When I've finished all the edits I've made (e.g., DOIs added, etc.), the web hook is automatically called and the taxonomic database notified of the edits.
  4. The taxonomic database processes the edits, and if it accepts them it updates its own records

Several things are needed to make this work. We need to be able to talk about the same record in the taxonomic database and in Mendeley, which means either the database stores the Mendeley identifier, or visa versa, or both. We also need a way to find all the recent edits made in Mendeley. Given that the Mendeley database is stored locally as a SQLite database, one simple hack would be to write a script that was called at a set time, determined which records had been changed (records in the Mendeley SQLite database are timestamped) and send those to the web hook. If we're clever, we may even be able to automate this by calling the script when Mendeley quicks (depending on how scriptable the operating system and application are).

Of course, what would be even better is if the Mendeley application had this feature built in. You supply one or more web hook URLs that Mendeley will call, say after any edits have been synchronised with your Mendeley database in the cloud. More and more I think we need to focus on how we join all these tools and databases together, and web hooks look like being the obvious candidate.

Paper on NCBI and Wikipedia published in PLoS Currents: Tree of Life

__logo__1.jpg
My paper describing the mapping between NCBI and Wikipedia has been published in PLoS Currents: Tree of Life. You can see the paper here. It's only just gone live, so it's yet to get a PubMed Central number (one of the nice features of PLoS Currents is that the articles get archived in PMC).

Publishing in PLoS Currents: Tree of Life was a pleasant experience. The Google Knol editing environment was easy to use, and the reviewing process quick. It's obviously a new and rather experimental journal, and there are a few things that could be improved. Automatically looking up articles by PubMed identifier is nice, but it would also be great to do this for DOIs as well. Furthermore, the PubMed identifiers aren't displayed as clickable links, which rather defeats the point of having references on the web (I've added DOI links to the articles wherever possible). But, minor grumbles aside, as a way to get an Open Access article published for free, and have it archived in PubMed Central, PLoS Currents is hard to beat. What will be interesting is whether the article receives any comments. This seems to be one area online journals haven't really cracked — providing an environment where people want to engage in discussion.