Search this keyword

GBIF and Handles: admitting that "distributed" begets "centralized"

The problem with this ... is that my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.).
--Geoffrey Bilder interviewed by Martin Fenner

Thinking about the GUID mess in biodiversity informatics, stumbling across some documents about the PILIN (Persistent Identifier Linking INfrastructure) project, and still smarting from problems getting hold of specimen data, I thought I'd try and articulate one solution.

Firstly, I think biodiversity informatics has made the same mistake as digital librarians in thinking that people care where the get information from. We don't, in the sense that I don't care whether I get the information from Google or my local library, I just want the information. In this context local is irrelevant. Nor do I care about individual collections. I care about particular taxa, or particular areas, but not collections (likewise, I may care about philosophy, but not philosophy books at Glasgow University Library). I think the concern for local has lead to an emphasis on providing complex software to each data provider that supports operations (such as search) that don't scale (live federated search simply doesn't work), at the expense of focussing on simple solutions that are easy to use.

In a (no doubt unsuccessful) attempt to think beyond what I want, let's imagine we have several people/organisations with interests in this area. For example:

Imagine I am an occasional user. I see a specimen referred to, say a holotype, I want to learn more about that specimen. Is there some identifier I can use to find out more. I'm used to using DOIs to retrieve papers, what about specimens. So, I want:
  1. identifiers for specimens so I can retrieve more information

Imagine I am a publisher (which can be anything from a major commercial publisher to a blogger). I want to make my content more useful to my readers, and I've noticed that other's are doing this so I better get onboard. But I don't want to clutter my content with fragile links -- and if a link breaks I want it fixed, or I want a cached copy (hence the use of WebCite by some publishers). If I want a link fixed I don't want to have to chase up individual providers, I want one place to go (as I do for references if a DOI breaks). So, I want:
  1. stable links with some guarantee of persistence
  2. somebody who will take responsibility to fix the broken ones

Imagine I am a data provider. I want to make my data available, but I want something simple to put in place (I have better things to do with my time, and my IT department keep a tight grip on the servers). I would also like to be able to show my masters that this is a good thing to do, for example by being able to present statistics on how many times my data has been accessed. I'd like identifiers that are meaningful to me (maybe carry some local "branding"). I might not be so keen on some central agency serving all my data as if it was theirs. So, I want
  1. simplicity
  2. option to serve my own data with my own identifiers

Imagine I am an power user. I want lots of data, maybe grouped in ways that the data providers hadn't anticipated. I'm in a hurry, so I want to get this stuff quickly. So I want:
  1. convenient, fast APIs to fetch data
  2. flexible search interfaces would be nice, but I may just download it myself because it's probably quicker if I do it myself

Imagine I am an aggregator. I want data providers to have a simple harvesting interface so that I can grab the data. I don't need a search interface to their data because I can do it much faster if I have the data locally (federated search sucks). So I want:
  1. the ability to harvest all the data ("all your data are belong to me")
  2. a simple way to update my copy of provider's data when it changes

It's too late in the evening for me to do this justice, but I think a reasonable solution is this:
  1. Individual data providers serve their data via URLs, ideally serving a combination of HTML and RDF (i.e., linked data), but XML would be OK
  2. Each record (e.g., specimen) has an identifier that is locally unique, and the identifier is resolvable (for example, by simply appending it to a URL)
  3. Each data provider is encouraged to reuse existing GUIDs wherever possible, (e.g., for literature (DOIs) and taxonomic names) to make their data "meshable"
  4. Data provider can be harvested, either completey, or for records modified after a given date
  5. A central aggregator (e.g., GBIF) aggregates all specimen/observation data. It uses Handles (or DOIs) to create GUIDs, comprising a naming authority (one for each data provider), and an identifier (supplied by the data provider, may carry branding, e.g. "antweb:casent0100367"), so an example would be "hdl:1234567/antweb:casent0100367" or "doi:10.1234/antweb:casent0100367". Note that this avoids labeling these GUIDs as, say,
  6. Handles resolve to data provider URL, but cached aggregator copy of metadata may be used if data provide is offline
  7. Publishers use "hdl:1234567/antweb:casent0100367" (i.e., authors use this when writing manuscripts), as they can harass central aggregator if they break
  8. Central aggregator is reponsible for generating reports to providers of how there data has been used, e.g. how many times "cited" in literaure

So, GBIF (for whoever steps up to the plate) would use handles (or DOIs). This gives them the tools to manage the identifiers, plus tells the world that we are serious about this. Publishers can trust that the links to millions of specimen records won't disappear. Providers don't have complex software to install, removing one barrier to making more data available.

I think it's time we made a serious effort to address these issues.