Search this keyword

Wikipedia and Gregg's paradox

Continuing the theme of taxonomic classification in Wikipedia, I'm perversely delighted that Wikipedia demonstrates Gregg's paradox so nicely.

1s2mges1n3b5q1bnvf5a3i4u8y_2009-05-31.jpgThe late John R. Gregg wrote several papers and a book exploring the logical structure of taxonomy. His 1954 book The language of taxonomy stimulated a debate a decade later in Systematic Zoology concerning what Buck and Hull (1966) (doi:10.2307/2411628) termed "Gregg's Paradox".

Gregg showed that if we (a) treat taxa as sets defined by extension (i.e., by listing all members), and (b) accept that two sets with exactly the same content must be the same set, then many biological classifications violate these premises because the same taxon may be assigned to multiple levels in the Linnean hierarchy. For example, the aardvark, Orycteropus afer, is the only extant species of the genus Orycteropus, which is the only extant member of the family Orycteropodidae, which in turn is the sole extant representative of the order Tubulidentata. Under Gregg's model, Tubulidentata, Orycteropodidae, and Orycteropus are all the same thing as they have exactly the same content (i.e., Orycteropus afer). Put another way, monotypic taxa are redundant and violate basic set theory. Gregg would argue that they should be eliminated.

aardvark.pngWikipedia illustrates this nicely. Wikipedia conforms to Gregg's model in that taxa are defined by extension (each taxon comprises one or more wiki pages), and if taxa have the same content only one taxon (typically that with the lowest taxonomic rank) has a page in Wikipedia. Put another way, if the aardvark is the sole representative of the Tubulidentata, then there is nothing that could be put on the Tubulidentata page that shouldn't also belong on the page for the aardvark. As a result, the page for the aardvark gives a full classification of this animal, but most taxa in the hierarchy don't have their own pages.

Responses

There are several possible responses to Gregg's paradox. One is to argue that taxa should be defined intentionally (i.e., on the basis of their characters), which was Buck and Hull's approach. Essentially, they were arguing that we could (somewhat arbitrarily) specify properties of Orycteropodidae that weren't shared by all Tubulidentata, and hence we are justified in keeping these taxa separate. Gregg himself was less than impressed by this argument (doi:10.2307/2412017).

Another approach is to suggest that we may discover taxa in the future that will, say, be members of Orycteropus but which aren't O. afer, and that the taxa between the rank suborder and species are placeholders for these discoveries. Indeed, in the case of the Tubulidentata there are extinct aardvarks (doi:10.1163/002829675x00137, doi:10.1016/j.crpv.2005.12.016, and doi:10.1111/j.1096-3642.2008.00460.x) that could be added to Wikipedia, thus justifying the creation of pages for the taxa that Gregg would have us eliminate.

Of course, Gregg's paradox is a consequence of having ranks and requiring each rank (or at least a reasonable subset of them) to exist in a classification. If we ignore ranks, then there's no reason to put any taxa between Afrotheria and Orycteropus afer. So, we could drop this requirement for having taxa at each rank or, of course, drop ranks altogether, which is one of the motivations behind phylogenetic classifications (e.g., the phylocode).

Implications for parsing Wikipedia

From a practical point of view, Gregg's paradox means that one has to be careful parsing Wikipedia Taxoboxes. As I've argued earlier, the simplest way to ensure that a classification is a tree is for each taxon to include a unique parent taxon. The simplest way to extract this for a taxon in a Wikipedia page would be to retrieve the taxon immediately above it in the classification (i.e., for Orycteropus afer this would be Orycteropus). But Orycteropus doesn't have a page in Wikipedia (OK, it does, but it's a redirect to the page for the aardvark). So, we have to go up the classification until we hit Afrotheria before we get a taxon page.

Personally I quite like the fact that a largely forgotten argument from the middle of the last century concerning logic and Linnean taxonomy seems relevant again.

Wikipedia's taxonomic classification is badly broken

Wikipedia is wonderful, but parts of it are horribly broken. Take, for example, taxonomic classifications. A classification is a rooted tree, which means that each node in the tree has a single parent. We can store trees in databases in a variety of ways. For example, for each node we could store a list of its children, or we could store the single unique parent of each node. Ideally we'd choose to store one or other, but not both. If we store both sets of statements (i.e., that node A has node B as one of its children, and that node B's parent is node A) then there is enormous potential for these two statements to get out of sync.
tree.png


This is what has happened in Wikipedia. Each page for a taxon lists the lineage to which it belongs (i.e., its parent, and its parent's parent, and so on), and also lists the children of that node. What this means is that if somebody edits the page for taxon A and adds taxon B as a child, they also need to edit the page for taxon B to make A its parent. If only one of these two edits is made the classification may end up internally inconsistent.

For example, the page for Amphibia lists the classification of Amphibia like this:
a1.png

It also lists the child taxa of Amphibia:
a2.png

So, the children of Amphibia are Temnospondyli, Lepospondyli, and Lissamphibia. Furthermore, Anura, Caudata, and Gymnophiona are children of Lissamphibia:

child.png


Given this, if I go to the pages for Anura, Caudata, and Gymnophiona I should see that each of these taxa lists Lissamphibia as its parent. However, only one of these (Caudata) does: the Anura and Gymnophiona both have Amphibia as their parents, not Lissamphibia.

The diagram below shows the taxa that have Amphibia as their parent:
parent.png


Note that Stegocephalia have now turned up as an addition amphibian order, and that only Caudata is included in Lissamphibia. But what is striking is that another 274 Wikipedia taxon pages have Amphibia as their parent. These pages are all for fossil amphibians that do not fit easily in the existing Wikipedia classification.

From the perspective of building a database, the "has parent" relationship is the one I'd prefer to use, because that statement is going to be made just once (on the page for the taxon of interest). This seems a lot safer than making the statement "has child" on another page (for one thing, more than one page could claim a taxon as their child, which again will break the tree). But if we use the "has parent" relationship, our tree will be very bushy, with lots of fossil amphibian genera attached to the Amphibia node. This is going to make the tree hard to interpret, because this basal bush isn't saying that all these genera radiated off at once, but rather that we don't really know where in the amphibian tree these things go, so we'll have to settle for saying merely "they are amphibians" (for the cladistic theorists among you, this is Nelson and Platnick's "interpretation 2" in their "Multiple Branching in Cladograms: Two Interpretations", doi:10.2307/2412630).

So, the dilemma is whether to use "has child" relationships, and accept that these are likely to be inconsistent with the inverse "has parent" relationship, or use the "has parent" relationship, which will be internally consistent, but at the cost of potentially very large, unresolved bushes due to fossil taxa of uncertain affinities.

Biodiversity Heritage Library sparklines

Time for a quick and dirty Friday afternoon hack. Based on responses to the BHL timeline I released two days ago, I've created a version that can compare the history of two names using sparklines (created using Google's Chart API). I use sparklines to give a quick summary of hits over time (grouped by decade).

The demo is here. It's crude (minimal error checking, no progress bars while it talks to BHL), but it's home time. As an example, here is a screen shot comparing the occurrences in BHL for two rival names for the sperm whale, Physeter catodon and Physeter macrocephalus:
physeter.png

There is a link to the full timeline for each of these names so you can investigate more. Note that the sparklines will be heavily biased by BHL coverage, but it may yield some interesting insights into the history of the usage of a name.