Search this keyword

DjVu XML to HTML

This post is simply a quick note on some experiments with DjVu that I haven't finished. Much of BHL's content is available as DjVu files, which contain both the scanned images and OCR text, complete with co-ordinates of each piece of text. This means that it would, in principle, be trivial to lay out the bounding boxes of each text element on a web page. Reasons for doing this include:

  1. To support Chris Freeland's Holy Grail of Digital Legacy Taxonomic Literature, where user can select text overlaid on BHL scan image.

  2. Developing a DjVu viewer along the lines of Google's very clever Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?).

  3. Highlighting search results on a BHL page image (by highlighting the boxes containing terms the user was searching for).


As an example, here is a BHL page image:



and here's the bounding boxes of the text recognised by OCR overlain on the page image:





























































































































































































































































and here's the bounding boxes of the text recognised by OCR without the page image:































































































































































































































































The HTML is generated using a XSL transformation that take two parameters, an image name and a scale factor, where 1.0 generates HTML at the same size as the original image (which may be rather large). The view above were generated with a scale of 0.1. The XSL is here:


<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="scale"/>
<xsl:param name="image"/>

<xsl:template match="/">
<xsl:apply-templates select="//OBJECT"/>
</xsl:template>

<xsl:template match="//OBJECT">
<div>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>position:relative;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>

<img>
<xsl:attribute name="src">
<xsl:value-of select="$image"/>
</xsl:attribute>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>margin:0px;padding:0px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>
</img>

<xsl:apply-templates select="//WORD"/>

</div>
</xsl:template>

<xsl:template match="//WORD">
<div>
<xsl:attribute name="style">
<xsl:text>position:absolute;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:variable name="coords" select="@coords"/>
<xsl:variable name="minx" select="substring-before($coords,',')"/>
<xsl:variable name="afterminx" select="substring-after($coords,',')"/>
<xsl:variable name="maxy" select="substring-before($afterminx,',')"/>
<xsl:variable name="aftermaxy" select="substring-after($afterminx,',')"/>
<xsl:variable name="maxx" select="substring-before($aftermaxy,',')"/>
<xsl:variable name="aftermaxx" select="substring-after($aftermaxy,',')"/>
<xsl:variable name="miny" select="substring-after($aftermaxy,',')"/>

<xsl:text>left:</xsl:text>
<xsl:value-of select="$minx * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="($maxx - $minx) * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>top:</xsl:text>
<xsl:value-of select="$miny * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="($maxy - $miny) * $scale"/>
<xsl:text>px;</xsl:text>

</xsl:attribute>

<!-- actual text -->
<!-- <xsl:value-of select="." /> -->
</div>
</xsl:template>

</xsl:stylesheet>


TreeView and Windows Vista


Continuing the theme of ancient programs of mine still being used, I've been getting reports that the Windows version of TreeView won't install on Windows Vista and Windows 7. As with NDE, it's the installer that seems to be causing the problem. I've put a new installer on the TreeView web page (direct link here).

TreeView still seems to be used quite a bit, judging from responses to my question on BioStar, even through there are more modern alternatives. I made a brief attempt to create a replacement, namely TreeView X, but it lacks much of the functionality of the original. I keep meaning to revisit TreeView development, as it's been very good to me.

BHL Tech Report

Chris Freeland tweeted about the presentation he gave to a recent BHL meeting, the slides for which are reproduced below:


It makes interesting viewing. From my own (highly biased) perspective slide 10 is especially interesting, in that BioStor, my tool for locating articles in BHL is the 8th largest source of traffic for BHL - 2.46% of visitors to BHL come via BioStor. By way of comparison, the largest single referrer for the same period is Tropics (11.14%), with EOL third at 6.45% and Wikipedia fourth at 5.56%. Obviously this is a small sample, but it is vaguely encouraging. It is nice to see Wikipedia being a significant source of links. EOL has dropped significantly from being the largest referrer (22.51%) in 2008-2009 to third (6.45%) for 2010 to date. In one sense this may reflect BHL achieving greater visibility independently of EOL. I also wonder whether it might reflect the unsatisfactory way EOL displays BHL results? If the BHL results were grouped (for example, by article), then I suspect they may provide more enticing links for EOL users to click on (especially, for example, if the articles were flagged by whether they included the original description of a taxon). It would be fun to know what EOL users actually do (e.g., what fraction of EOL visitors to a given page click on BHL links?).