ChrisFreeland.com: #ebio09, silverbacks, & haiku
Chris Freeland has written a thoughtful summary of his experiences of the two-day closed session to create a road map for biodiversity informatics, entitled #ebio09, silverbacks, & haiku.
Labels:
e-Biosphere
,
ebio09
,
twitter
Taxonomy on a hard disk
This post is likely to seem somewhat off the wall, given the rush to getting everything in the cloud, but it's Friday, so let's give it a whirl.
One idea I've been toying with is dispensing with relational databases, wikis, etc. and just storing taxonomic data using files and folders on a disk. There are several reasons for this:
So, in some ways this probably sounds silly, and closely resembles the naive way many of us started making digital versions of taxonomies, and it will have many database people rolling their eyes and muttering about "data consistency" and "queries". But, a key thing to remember is that the file system is a database that resides under a graphical user interface, and it maintains some forms of consistency that classical relational databases are poor at handling. For example, file systems enforce hierarchical consistency (if I move a folder to another folder, all the files and folders below that folder move as well). Of course, we can program this with a relational database, but our track record in doing this is pretty miserable. I've found inconsistencies in versions of ITIS (haven't checked recently), and last years' Catalogue of Life database had all sorts of orphans lurking in the tree table.
Then there's the GUI. If I write a taxonomic database in the classical way, I need to write code to talk to the database, edit records, support user authentication, data versioning, etc. If I use the file system, I get this pretty much for free. Authentication? It's called the login screen. Versioning? I put it in a public repository like Google Code or github, and that takes care of that (plus I get online authentication for free). Editing? Well, I can drag and drop items onto folders, and I can open them in native editors.
What I envisage is replicating a taxonomic hierarchy on disk, and representing key-value pairs of attributes (such as taxon name authorship, bibliographic details) as text files where the name of the file is the key (e.g., publishedIn) and the content of the file is the value (e.g., doi:10.1590/S0101-81752005000300004). I could add images and PDFs, and the neat thing is that they have lots of useful metadata embedded inside (where, arguably, it belongs).
I'm also toying with the idea of using symbolic links (Windows users, look away now) to represent relationships such as basionym links to original names.
This is all a bit half-baked at present, but it seems worth pursing. One could argue that having a full taxonomic hierachy is overkill (and raises the issue of which one to use), but binomial names are themselves hierachical (species epithet nested inside genus name), so we need some degree of hierarchy anyway. I like the idea that copying a folder called "behreae" in the folder "Pinnixa" and placing the copy under "Austinixa", then within Austinixa/behreae adding a symbolic link to Pinnixa/behreae pretty much takes care of synonomy. I also like the idea that one could download an entire taxonomy, and using just the native tools on your computer, edit and annotate it, then merge changes with a remote copy. It makes mamnaging the data little different from writing code.
In practise we'll want to add some things. It would be nice to have a web interface for browsing, but this could be as trivial as having a script that read the contents of a folder, display folders as HTML links, and list the files (keys) and their contents (values) in the web page.
Perhaps this is a little silly, but I like the idea of having data on my machine that is trivially easy to edit. I also like the idea of getting functionality for free, rather than having to invent it from scratch.
One idea I've been toying with is dispensing with relational databases, wikis, etc. and just storing taxonomic data using files and folders on a disk. There are several reasons for this:
- File system naturally enforces hierachy
- There are existing systems for putting files and folders under version control (e.g., CVS, Subversion, git)
- Native text and image editors handily beat web-based ones
- Some file systems have great tools for searching on metadata (e.g., "smart folders" and Spotlight on Mac OS X)
- Some of the visualisations that we would like for classifications (such as treemaps) already exist in very polished form for viewing file systems
So, in some ways this probably sounds silly, and closely resembles the naive way many of us started making digital versions of taxonomies, and it will have many database people rolling their eyes and muttering about "data consistency" and "queries". But, a key thing to remember is that the file system is a database that resides under a graphical user interface, and it maintains some forms of consistency that classical relational databases are poor at handling. For example, file systems enforce hierarchical consistency (if I move a folder to another folder, all the files and folders below that folder move as well). Of course, we can program this with a relational database, but our track record in doing this is pretty miserable. I've found inconsistencies in versions of ITIS (haven't checked recently), and last years' Catalogue of Life database had all sorts of orphans lurking in the tree table.
Then there's the GUI. If I write a taxonomic database in the classical way, I need to write code to talk to the database, edit records, support user authentication, data versioning, etc. If I use the file system, I get this pretty much for free. Authentication? It's called the login screen. Versioning? I put it in a public repository like Google Code or github, and that takes care of that (plus I get online authentication for free). Editing? Well, I can drag and drop items onto folders, and I can open them in native editors.
What I envisage is replicating a taxonomic hierarchy on disk, and representing key-value pairs of attributes (such as taxon name authorship, bibliographic details) as text files where the name of the file is the key (e.g., publishedIn) and the content of the file is the value (e.g., doi:10.1590/S0101-81752005000300004). I could add images and PDFs, and the neat thing is that they have lots of useful metadata embedded inside (where, arguably, it belongs).
I'm also toying with the idea of using symbolic links (Windows users, look away now) to represent relationships such as basionym links to original names.
This is all a bit half-baked at present, but it seems worth pursing. One could argue that having a full taxonomic hierachy is overkill (and raises the issue of which one to use), but binomial names are themselves hierachical (species epithet nested inside genus name), so we need some degree of hierarchy anyway. I like the idea that copying a folder called "behreae" in the folder "Pinnixa" and placing the copy under "Austinixa", then within Austinixa/behreae adding a symbolic link to Pinnixa/behreae pretty much takes care of synonomy. I also like the idea that one could download an entire taxonomy, and using just the native tools on your computer, edit and annotate it, then merge changes with a remote copy. It makes mamnaging the data little different from writing code.
In practise we'll want to add some things. It would be nice to have a web interface for browsing, but this could be as trivial as having a script that read the contents of a folder, display folders as HTML links, and list the files (keys) and their contents (values) in the web page.
Perhaps this is a little silly, but I like the idea of having data on my machine that is trivially easy to edit. I also like the idea of getting functionality for free, rather than having to invent it from scratch.
Labels:
crazy
,
database
,
filesystem
,
taxonomy
e-Biosphere '09: Twitter rules, and all that

So, e-Biosphere '09 is over (at least for the plebs like me, the grown ups get to spend two days charting the future of biodiversity informatics). It was an interesting event, on several levels. It's late, and I'm shattered, so this post ill cover only a few things.
This was first conference I'd attended where some of the participants twittered during proceedings. A bunch of us settled on the hashtag #ebio09 (you can also see the tweets at search.twitter.com). For the uninitiated, a "hashtag" is a string preceded by a hash symbol (#), to indicate that it is a tag, such as #fail. It provides a simple way to tag tweets so that others interested in that topic can find them.
Twittering created a whole additional layer to the conference. We were able to:
- Moan about the appallingly bad wifi "@Acronema: Using wifi here at #ebio09 is like wading through treacle!
- Moan about the food (which in general was good) @peetucket: lime banana chocolote cheescake = worst dessert ever #ebio09 #dessert #fail
- Embellish the presentations with links to related material @rdmpage: Vishwas Chavan ZooKeys example of publishing data with paper, see http://dx.doi.org/10.3897/zookeys.11.210 #ebio09
- Engage with people outside the room @jatorre @rdmpage thanks Rod.For those of us who are on the booths is great to get your live reports :) #ebio09, or indeed on the other side of the planet @kejames: @Jim_Croft Re. blogger bioblitz google group's 'new members': I noticed too and have emailed Joel Sachs who promoted the site 2day. #ebio09
Twitter greatly enhanced the conversation, noticeably when a speaker said something controversial (all too rare, sadly), or when a group rapporteur's summary didn't reflect all the views in that group. It also helped document what was going on, and this can be further exploited. For fun, I grabbed tweets from days 2 and 3 and made a wordle:
As @edwbaker noted @edwbaker @rdmpage The size of 'together', 'people' & 'visionary' is somewhat telling...... In case you're wondering about the prominence of "Knowlton", it's because Nancy Knowlton gave a nice talk highlighting the every increasing number of cases where we have no names for the things we are encountering (for example, when barcoding fresh samples from poorly studied environments). This is just one example of the huge disconnect between the obsession with taxonomic names in biodiversity informatics, and the reality of metagenomics and DNA barcoding. Just as worrying is the lack of resemblance of the taxonomic classification used by the Encyclopedia of Life and our notion of the evolutionary tree of those organisms. A systematist would find much of EOL's classification laughable. I don't want to bash EOL, but it's worrying that they can continue to crank out press releases, but fail to provide something like a modern classification.But I digress. In many ways this was less of a scientific conference and more of an event to birth a discipline, namely "biodiversity informatics" (which I'm sure some would claim as been around for quite a while). So, the event was to attract attention to the topic, and assure the outside world (and those attending) that the field exists and has something to say. It also was billed as a forum to discuss strategies for its future. Sadly, much of this discussion will take place behind closed doors, and will feature the major players who bring money and influence (but not much innovation) to the table.
Symptomatic of this lack of innovation, in a sense, was the contrast between the official "Online Conference Community", and the twitter feed. When I asked if anybody on twitter had used the official forum, @fak3r replied tellingly: @rdmpage thought we were on it ;) #ebio09. As fun as it is to use the new hotness to conduct a parallel (and slightly subversive) discussion at a conference it's worrying that, in a field that calls itself "informatics" the big beasts probably had little idea what was going on. If we are going to exploit the tools the web provides, we need people who "get it", and I'm unconvinced that the big players in this area truly grasp the web (in all it's forms). There's also a worrying degree of physics envy, which might be cured by reading The Unreasonable Effectiveness of Data (doi:10.1109/mis.2009.36).
I tried to stir things up a little (almost literally as captured in this photo by Chris Freeland), with a couple of questions, but to not much effect (other than apparently driving to despair the poor chap behind me ).

But enough grumbling. It was great to see lots of people attending the event, the were lots of interesting posters and booths (creating a market for this field may go some way towards providing an incentive to provide better, more reliable services), and my challenge entry won joint first prize, so perhaps I should sit back, enjoy the wine Joel Sachs choose as the prize (many thanks for his efforts in putting the challenge event together), and let others say what they thought of the meeting.
Labels:
Challenge
,
conference
,
e-Biosphere
,
twitter
Subscribe to:
Comments
(
Atom
)