Search this keyword

Wiki frustration

Yesterday I fired off a stream of tweets, starting with:
OK, I'm back to hatin' on the wikis. It's just way to hard to do useful queries (e.g., anything that requires a path linking some entities) [10991434609]

Various people commented on this, either on twitter or in emails, e.g.:
@rdmpage hatin' wikis is irrational - all methods have pros and cons, so wikis are an essential resource among others [@stho002]

So, to clarify, I'm not abandoning wikis. I'm just frustrated with the limitations of Semantic Mediawiki (SMW). Now, SMW is a great piece of software with some cool features. For example,

  1. Storing data in Mediawiki templates (i.e., key-value pairs) makes it rather like a JSON database with version control (shades of CouchDB et al.).

  2. Having Mediawiki underlying SMW makes lots of nice editing and user management features available for free.

  3. The template language makes it relatively easy to format pages.

  4. It is easy to create forms for data entry, so users aren't confronted with editing Mediawiki templates unless they want to.

  5. Supports basic queries.


It's the last item on the list that is causing me grief. The queries are, well, basic. What SMW excels at are queries connecting one page to another. For example, if I create wiki pages for publications, and list the author of each publication, the page for an author can contain a simple query (of the form {{ #ask: [[maker::Author 1]] | }}) that lists all the publications of that author:

q1.png


That's great, and many of the things are want to do can be expressed in this simple way (i.e., find all pages of a certain kind that link to the current page). It's when I want to go more than one page away that things start to go pear shaped. For example, for a given taxon I want to display a map of where it is found, based things like geoferenced GenBank sequences, or sequences from georeferenced museum specimens.

q2.png
This can involve a path between several entities ("pages" in SMW), and this just doesn't seem to be possible. I've managed to get some queries working via baroque Mediawiki templates, but it's getting to the point where hours seem to be wasted trying to figure this stuff out.

So, what tends to happen is I develop something, hit this brick wall, then go and do something else. I suspect that one way forward is to use SMW as the tool to edit data and display basic links, then use another tool to do the deeper queries. This is a bit like what I'm exploring with BioStor, where I've written my own interface and queries over a simple MySQL database, but I'm looking into SMW + Wikisource to handle annotation.

This leaves the question of how to move forward with http://iphylo.org/treebase/? One approach would be to harvest the SMW pages regularly (e.g., by consuming the RSS feed, and either pulling off SMW RDF or parsing the template source code for the pages), use this to populate a database (say, a key-value store or a triple store) where more sophisticated queries can be developed. I guess one could either make this a separate interface, or develop it as a SMW extension, so results could be viewed within the wiki pages. Both approaches have merits. Having a complete separate tool that harvests the phylogeny wiki seems attractive, and in many ways is an obvious direction for iSpecies to take. Imagine iSpecies rebuilt as an RDF aggregator, where all manner of data about a taxon (or sequences, or specimen, or publication, or person) could be displayed in one place, but the task of data cleaning took place elsewhere.

Food for thought. And, given that it seems some people are wondering why on earth I bother with this stuff, and can't I just finish TreeView X, I could always go back to fussing with C++…

DjVu XML to HTML

This post is simply a quick note on some experiments with DjVu that I haven't finished. Much of BHL's content is available as DjVu files, which contain both the scanned images and OCR text, complete with co-ordinates of each piece of text. This means that it would, in principle, be trivial to lay out the bounding boxes of each text element on a web page. Reasons for doing this include:

  1. To support Chris Freeland's Holy Grail of Digital Legacy Taxonomic Literature, where user can select text overlaid on BHL scan image.

  2. Developing a DjVu viewer along the lines of Google's very clever Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?).

  3. Highlighting search results on a BHL page image (by highlighting the boxes containing terms the user was searching for).


As an example, here is a BHL page image:



and here's the bounding boxes of the text recognised by OCR overlain on the page image:





























































































































































































































































and here's the bounding boxes of the text recognised by OCR without the page image:































































































































































































































































The HTML is generated using a XSL transformation that take two parameters, an image name and a scale factor, where 1.0 generates HTML at the same size as the original image (which may be rather large). The view above were generated with a scale of 0.1. The XSL is here:


<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="scale"/>
<xsl:param name="image"/>

<xsl:template match="/">
<xsl:apply-templates select="//OBJECT"/>
</xsl:template>

<xsl:template match="//OBJECT">
<div>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>position:relative;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>

<img>
<xsl:attribute name="src">
<xsl:value-of select="$image"/>
</xsl:attribute>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>margin:0px;padding:0px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>
</img>

<xsl:apply-templates select="//WORD"/>

</div>
</xsl:template>

<xsl:template match="//WORD">
<div>
<xsl:attribute name="style">
<xsl:text>position:absolute;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:variable name="coords" select="@coords"/>
<xsl:variable name="minx" select="substring-before($coords,',')"/>
<xsl:variable name="afterminx" select="substring-after($coords,',')"/>
<xsl:variable name="maxy" select="substring-before($afterminx,',')"/>
<xsl:variable name="aftermaxy" select="substring-after($afterminx,',')"/>
<xsl:variable name="maxx" select="substring-before($aftermaxy,',')"/>
<xsl:variable name="aftermaxx" select="substring-after($aftermaxy,',')"/>
<xsl:variable name="miny" select="substring-after($aftermaxy,',')"/>

<xsl:text>left:</xsl:text>
<xsl:value-of select="$minx * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="($maxx - $minx) * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>top:</xsl:text>
<xsl:value-of select="$miny * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="($maxy - $miny) * $scale"/>
<xsl:text>px;</xsl:text>

</xsl:attribute>

<!-- actual text -->
<!-- <xsl:value-of select="." /> -->
</div>
</xsl:template>

</xsl:stylesheet>


TreeView and Windows Vista


Continuing the theme of ancient programs of mine still being used, I've been getting reports that the Windows version of TreeView won't install on Windows Vista and Windows 7. As with NDE, it's the installer that seems to be causing the problem. I've put a new installer on the TreeView web page (direct link here).

TreeView still seems to be used quite a bit, judging from responses to my question on BioStar, even through there are more modern alternatives. I made a brief attempt to create a replacement, namely TreeView X, but it lacks much of the functionality of the original. I keep meaning to revisit TreeView development, as it's been very good to me.