Search this keyword

Linking Bulletin of Zoological Nomenclature to BHL

5839_200px.1254992426.png

After some fussing and hair pulling I've constructed a demo of linking a journal to the Biodiversity Heritage Library and displaying the results in Zotero (see my earlier post for rationale).

After some searching I managed to retrieve metadata for several hundred article from the Bulletin of Zoological Nomenclature. Using a local copy of the BHL metadata, I wrote a script that looked up each article in BHL and found the URL of the first page in the article. I then created a Zotero group for this journal and uploaded the linked references.

You can browse the group, and if you belong to Zotero you can join the group and get a local copy of the references (which you can edit and correct, the harvesting won't be perfect).

I've also added these references to my
bioGUID OpenURL resolver, making it easy to find a given article. For example, the OpenURL link http://bioguid.info/openurl/?genre=article&issn=0007-5167&volume=51&spage=7 displays a page for the article "Doris grandiflora Rapp, 1827 (currently Dendrodoris grandiflora; Mollusca, Gastropoda): proposed conservation of the specific name", together with a link to the article in BHL.





Zotero group for Biodiversity Heritage Library content

One thing I find myself doing (probably more often than I should) is adding a reference to my Zotero library for an item in the Biodiversity Heritage Library (BHL). BHL doesn't have article-level metadata (see But where are the articles?), so when I discover a page of interest (e.g., one that contains the original description of a taxon) I store metadata for the article containing that page in my Zotero library. Typically this involves manually step back through scanned pages until I find the start of the article, then store that URL as well as the page number as a Zotero item. As an example, here is the record for Ogilby , J. Douglas (1907). A new tree frog from Brisbane. Proceedings of the Royal Society of Queensland 20:31-32. The URL in the Zotero record http://www.biodiversitylibrary.org/page/13861218 take you to the first page of this article.

One reason for storing metadata in Zotero is so that these reference are made available through the bioGUID OpenURL resolver. This is achieved by regularly harvesting the RSS feed for my Zotero account, and adding items in that feed to the bioGUID database of articles. This makes Zotero highly attractive, as I don't have to write a graphical user interface to create bibliographic records. BHL have their own citation editor in the works ("CiteBank"), based on the ubiquitous Drupal, but I wonder whether Zotero is the better bet -- it has a bunch of nice features, including being able to sync local copies across multiple machines, and store local copies of PDFs (synced using WebDAV).

For fun I've created a Zotero group called Biodiversity Heritage Library which will list articles that I've extracted from BHL. At some point I may investigate automating this process of extracting articles (using existing blbiliographic metadata mapped to BHL page numbers), but for now there a mere 27 manually acquired items listed in the BHL group.




Wikipedia and Gregg's paradox

Continuing the theme of taxonomic classification in Wikipedia, I'm perversely delighted that Wikipedia demonstrates Gregg's paradox so nicely.

1s2mges1n3b5q1bnvf5a3i4u8y_2009-05-31.jpgThe late John R. Gregg wrote several papers and a book exploring the logical structure of taxonomy. His 1954 book The language of taxonomy stimulated a debate a decade later in Systematic Zoology concerning what Buck and Hull (1966) (doi:10.2307/2411628) termed "Gregg's Paradox".

Gregg showed that if we (a) treat taxa as sets defined by extension (i.e., by listing all members), and (b) accept that two sets with exactly the same content must be the same set, then many biological classifications violate these premises because the same taxon may be assigned to multiple levels in the Linnean hierarchy. For example, the aardvark, Orycteropus afer, is the only extant species of the genus Orycteropus, which is the only extant member of the family Orycteropodidae, which in turn is the sole extant representative of the order Tubulidentata. Under Gregg's model, Tubulidentata, Orycteropodidae, and Orycteropus are all the same thing as they have exactly the same content (i.e., Orycteropus afer). Put another way, monotypic taxa are redundant and violate basic set theory. Gregg would argue that they should be eliminated.

aardvark.pngWikipedia illustrates this nicely. Wikipedia conforms to Gregg's model in that taxa are defined by extension (each taxon comprises one or more wiki pages), and if taxa have the same content only one taxon (typically that with the lowest taxonomic rank) has a page in Wikipedia. Put another way, if the aardvark is the sole representative of the Tubulidentata, then there is nothing that could be put on the Tubulidentata page that shouldn't also belong on the page for the aardvark. As a result, the page for the aardvark gives a full classification of this animal, but most taxa in the hierarchy don't have their own pages.

Responses

There are several possible responses to Gregg's paradox. One is to argue that taxa should be defined intentionally (i.e., on the basis of their characters), which was Buck and Hull's approach. Essentially, they were arguing that we could (somewhat arbitrarily) specify properties of Orycteropodidae that weren't shared by all Tubulidentata, and hence we are justified in keeping these taxa separate. Gregg himself was less than impressed by this argument (doi:10.2307/2412017).

Another approach is to suggest that we may discover taxa in the future that will, say, be members of Orycteropus but which aren't O. afer, and that the taxa between the rank suborder and species are placeholders for these discoveries. Indeed, in the case of the Tubulidentata there are extinct aardvarks (doi:10.1163/002829675x00137, doi:10.1016/j.crpv.2005.12.016, and doi:10.1111/j.1096-3642.2008.00460.x) that could be added to Wikipedia, thus justifying the creation of pages for the taxa that Gregg would have us eliminate.

Of course, Gregg's paradox is a consequence of having ranks and requiring each rank (or at least a reasonable subset of them) to exist in a classification. If we ignore ranks, then there's no reason to put any taxa between Afrotheria and Orycteropus afer. So, we could drop this requirement for having taxa at each rank or, of course, drop ranks altogether, which is one of the motivations behind phylogenetic classifications (e.g., the phylocode).

Implications for parsing Wikipedia

From a practical point of view, Gregg's paradox means that one has to be careful parsing Wikipedia Taxoboxes. As I've argued earlier, the simplest way to ensure that a classification is a tree is for each taxon to include a unique parent taxon. The simplest way to extract this for a taxon in a Wikipedia page would be to retrieve the taxon immediately above it in the classification (i.e., for Orycteropus afer this would be Orycteropus). But Orycteropus doesn't have a page in Wikipedia (OK, it does, but it's a redirect to the page for the aardvark). So, we have to go up the classification until we hit Afrotheria before we get a taxon page.

Personally I quite like the fact that a largely forgotten argument from the middle of the last century concerning logic and Linnean taxonomy seems relevant again.