I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?
It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.
In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.
Setting up a local Wikisource
A little while ago I came across Wikisource, and it dawned on me that this is a model for BHL. To quote from the Wikisource web site:
Much of their content comes from the Internet Archive (as does BHL's), and Wikisource have developed extensions for Mediaiwki to do some cool things, such as extract text and images from DjVu files. If you haven't come across DjVu before, it's a format designed to store scanned documents, and comes with some powerful open source tools for extracting images and OCR text. Wikisource can take a DjVu file, extract images, thumbnails and text, creating side-by-side displays where users can edit and correct OCR text:

So, like a fool, I decided to try and install some of these tools locally and see if I could do the same for some BHL content. That's when the "fun" started. Normally Mediawiki is pretty easy to set up. There are a few wrinkles because my servers live behind an institutional HTTP proxy, so I often need to tweak some code (such as the OpenID extension, which also needs a fix for PHP 5.3), but installing the extensions that underlie Wikisource wasn't quite so straightforward.
DjVu

The first step is supporting DjVu files in Mediawiki. This seems straightforward (see How to use DjVu with MediaWiki). First off you need the DjVu tools. I use Mac OS X, so I get these automatically if I install DjView. The tools reside in
But I also need NetPbm, and now the pain starts. NetPbm won't build on Mac OS X, at least not out of the box on Snow Leopard. It makes assumptions about Unix that Mac OS X doesn't satisfy. After some compiler error messages concerning missing variables that I eventually traced to
OK, now we can display DjVu files in Mediawiki. It's small victories like this which leads to over confidence...
Proofread Page
Next comes the Proofread Page extension, which provides the editing functionality. This seemed fairly straightforward, although the documentation referred to a SQL file (
This seems fine, except the page for WebStore states:
Then there are the numerous statements "doesn't work" scattered through the page. So, I installed the extension and hoped for the best. It didn't work. As in, really, really didn't work. It took an hour or so of potentially fatal levels of blood pressure-inducing frustration to get to the bottom of this.
WebStore
Now, Webstore is a clever idea. Basically, the Proofread Page extension will need thumbnails of images in potentially varying sizes, and creates a link to the image it wants. Since that image doesn't exist on the web site the web server returns 404 Not Found, which normally results in a page like this. Instead, we tell the web server (Apache) that WebStore will handle 404's. If the request is for an image, Webstore creates the image file, streams it to the web browser, then deletes the file from disk. Essentially WebStore creates a web server for images (Webdot uses much the same trick, but without the 404 handler). Debugging a web server called by another web server is tricky (at least for a clumsy programmer like me), but by hacking the Webstore code (and switching on Mediawiki debug logging) I managed to figure out that Webstore seemed to be fetching and streaming the images fine, but they simply didn't appear in the wiki page (I got the little broken image icon instead). I tried alternative ways of dumping the image file to output, adding HTTP headers, all manner of things. Eventually (by accident, no idea how it happened) I managed to get an image URL to display in the Chrome web browser, but it wasn't an image(!) -- instead I got a PHP warning about two methods in the class
OCR text
There are some issues with OCR text from Internet Archive DjVu files. There are some extraneous characters (new lines, etc.) that I need to filter, and I'll probably have to deal with hyphenation. It looks fairly straightforward to edit the proofing extension code to handle these situations.
Semantic Mediawiki
Having got the proofing extension working, I then wanted to install the Semantic Mediawiki extensions so that I could support basic inference on the wiki. I approached this with some trepidation as there are issues with Mediawiki namespaces, but everything played nice and so far things seem to be working. Now I can explore whether I can combine the proofing tools from Wikisource with the code I've developed for iTaxon.
BioStor
So, having got something working, the plan is to integrate this with BioStor. One model I like is the video site Metacafe. For each video Metacafe has a custom web page(e.g., http://www.metacafe.com/watch/4137093) with an Edit Video Details link that takes you to a Semantic Mediawiki page where you can edit metadata for the video. I envisage doing something similar for BioStor, where my existing code provides a simple view of an article (perhaps with some nice visualisations), with a link to the corresponding wiki page where you can edit the metadata, and correct the OCR text.
Lessons
In the end I got there, although it was a struggle. Mediawiki is a huge, complicated bit of software, and is also part of a larger ecosystem of extensions, so it has enormous power. But there are lots of times when I think it would be easier if I wrote something to replicate the bit of functionality that I want. For example, side-by-side display of text and images would be straightforward to do. But once you start to think about supporting mark-up, user authentication, recording edit history, etc., the idea of using tools others have developed becomes more attractive. And the code is open source, which means if it doesn't work there's a fighting chance I can figure out why, and maybe fix it. It often feels harder than it should be, but I'll find out in the next few days whether yesterday's exertions were worth it.
Wikisource is an online library of free content publications, collected and maintained by our community. We now have 140,596 texts in the English language library. See our inclusion policy and help pages for information on getting started, and the community portal for ways you can contribute. Feel free to ask questions on the community discussion page, and to experiment in the sandbox.
Much of their content comes from the Internet Archive (as does BHL's), and Wikisource have developed extensions for Mediaiwki to do some cool things, such as extract text and images from DjVu files. If you haven't come across DjVu before, it's a format designed to store scanned documents, and comes with some powerful open source tools for extracting images and OCR text. Wikisource can take a DjVu file, extract images, thumbnails and text, creating side-by-side displays where users can edit and correct OCR text:

So, like a fool, I decided to try and install some of these tools locally and see if I could do the same for some BHL content. That's when the "fun" started. Normally Mediawiki is pretty easy to set up. There are a few wrinkles because my servers live behind an institutional HTTP proxy, so I often need to tweak some code (such as the OpenID extension, which also needs a fix for PHP 5.3), but installing the extensions that underlie Wikisource wasn't quite so straightforward.
DjVu

The first step is supporting DjVu files in Mediawiki. This seems straightforward (see How to use DjVu with MediaWiki). First off you need the DjVu tools. I use Mac OS X, so I get these automatically if I install DjView. The tools reside in
Applications/DjView.app/Contents/bin (you can see this folder if you Control+click on the DjView icon and choose "Show Package Contents"), so adding this path to the name of each tool DjVu tool Mediaiwiki needs takes care of that. But I also need NetPbm, and now the pain starts. NetPbm won't build on Mac OS X, at least not out of the box on Snow Leopard. It makes assumptions about Unix that Mac OS X doesn't satisfy. After some compiler error messages concerning missing variables that I eventually traced to
signal.h I gave up and installed MacPorts, which has a working version of NetPbm. MacPorts installed fine, but it's a pain having multiple copies of the same tools, one in /usr/local, and one in /opt/local.OK, now we can display DjVu files in Mediawiki. It's small victories like this which leads to over confidence...
Proofread Page
Next comes the Proofread Page extension, which provides the editing functionality. This seemed fairly straightforward, although the documentation referred to a SQL file (
ProofreadPage.sql) that doesn't seem to exist. More worringly, the documentation also says:If you want to install it on your own wiki, you will need to install a 404 handler for generating thumbnails, such as WebStore.
This seems fine, except the page for WebStore states:
The WebStore extension is needed by the ProofreadPage extension. Unfortunately, documentation seems to be missing completely. Please add anything you know about this extension here.
Then there are the numerous statements "doesn't work" scattered through the page. So, I installed the extension and hoped for the best. It didn't work. As in, really, really didn't work. It took an hour or so of potentially fatal levels of blood pressure-inducing frustration to get to the bottom of this.
WebStore
Now, Webstore is a clever idea. Basically, the Proofread Page extension will need thumbnails of images in potentially varying sizes, and creates a link to the image it wants. Since that image doesn't exist on the web site the web server returns 404 Not Found, which normally results in a page like this. Instead, we tell the web server (Apache) that WebStore will handle 404's. If the request is for an image, Webstore creates the image file, streams it to the web browser, then deletes the file from disk. Essentially WebStore creates a web server for images (Webdot uses much the same trick, but without the 404 handler). Debugging a web server called by another web server is tricky (at least for a clumsy programmer like me), but by hacking the Webstore code (and switching on Mediawiki debug logging) I managed to figure out that Webstore seemed to be fetching and streaming the images fine, but they simply didn't appear in the wiki page (I got the little broken image icon instead). I tried alternative ways of dumping the image file to output, adding HTTP headers, all manner of things. Eventually (by accident, no idea how it happened) I managed to get an image URL to display in the Chrome web browser, but it wasn't an image(!) -- instead I got a PHP warning about two methods in the class
DjVuHandler (mustRender and isMultiPage) not being consistent with the class they inherit from. WTF?! Eventually I found the relevant file (DjVu.php in includes/media in the Mediawiki folder), added the parameter $file to both methods, and suddenly everything works. At this point I didn't know whether to laugh or cry.OCR text
There are some issues with OCR text from Internet Archive DjVu files. There are some extraneous characters (new lines, etc.) that I need to filter, and I'll probably have to deal with hyphenation. It looks fairly straightforward to edit the proofing extension code to handle these situations.
Semantic Mediawiki
Having got the proofing extension working, I then wanted to install the Semantic Mediawiki extensions so that I could support basic inference on the wiki. I approached this with some trepidation as there are issues with Mediawiki namespaces, but everything played nice and so far things seem to be working. Now I can explore whether I can combine the proofing tools from Wikisource with the code I've developed for iTaxon.
BioStor
So, having got something working, the plan is to integrate this with BioStor. One model I like is the video site Metacafe. For each video Metacafe has a custom web page(e.g., http://www.metacafe.com/watch/4137093) with an Edit Video Details link that takes you to a Semantic Mediawiki page where you can edit metadata for the video. I envisage doing something similar for BioStor, where my existing code provides a simple view of an article (perhaps with some nice visualisations), with a link to the corresponding wiki page where you can edit the metadata, and correct the OCR text.
Lessons
In the end I got there, although it was a struggle. Mediawiki is a huge, complicated bit of software, and is also part of a larger ecosystem of extensions, so it has enormous power. But there are lots of times when I think it would be easier if I wrote something to replicate the bit of functionality that I want. For example, side-by-side display of text and images would be straightforward to do. But once you start to think about supporting mark-up, user authentication, recording edit history, etc., the idea of using tools others have developed becomes more attractive. And the code is open source, which means if it doesn't work there's a fighting chance I can figure out why, and maybe fix it. It often feels harder than it should be, but I'll find out in the next few days whether yesterday's exertions were worth it.
Wikipedia manuscript
In his 2003 essay E O Wilson outlined his vision for an "encyclopaedia of life" comprising "an electronic page for each species of organism on Earth", each page containing "the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits." Although the "quiet revolution” in biodiversity informatics has generated numerous online resources, including some directly inspired by Wilson’s essay (e.g., http://ispecies.org, http://www.eol.org), we are still some way from the goal of having available online all relevant information about a species, such as its taxonomy, evolutionary history, genomics, morphology, ecology, and behaviour. While the biodiversity community has been developing a plethora of databases, some with overlapping goals and duplicated content, Wikipedia has been slowly growing to the point where it now has over 100,000 pages on biological taxa. My goal in this essay is to explore the idea that, largely independent of the efforts of biodiversity informatics and well-funded international efforts, Wikipedia (http://en.wikipedia.org/wiki/Main_Page) has emerged as potentially the best platform for fulfilling E O Wilson’s vision.
The content will be familiar to readers of this blog, although the essay is perhaps a slightly more sober assessment of Wikipedia than some of my blog posts would suggest. It was also the first manuscript I'd written in MS Word for a while (not a fun experience), and the first ever for which I'd used Zotero to manage the bibliography (which worked surprisingly well).
Subscribe to:
Posts
(
Atom
)