Search this keyword

Comparing Wikipedia and Mammal Species of the World classifications



Continuing the saga of making sense of the mammal classification in Wikipedia, I've done a quick comparison with the Mammal Species of the World (third edition) classification. MSW is the default taxonomic reference used by WikiProject Mammals. I downloaded the MSW taxonomy as a CSV file (warning, it's big), and wrote a script to pull out the classification as a GML file (my preferred graph format).

Based on some earlier work with Gabriel Valiente, I wrote a simple program that takes two trees and highlights the nodes in common to the two trees. I then input into this program the MSW tree, and the largest component of the graph of Wikipedia mammals. The MSW tree has 13582 nodes, the Wikipedia tree has 6287. Note that Wikipedia has more taxa than these 6287 nodes suggest, but they aren't connected to the largest tree (often due to intermediate nodes in the classification lacking a page in Wikipedia). The two trees have 4935 nodes in common (again, this number will be a little low, there are some weird taxon names due to problems parsing Wikipedia).

MSW versus Wikipedia
Below is a the MSW classification with taxa in Wikipedia shown in red.
w-msw.jpg


[Larger scale view here]

The impression given is that most Wikipedia mammal pages are in MSW, with some notable exceptions, including higher level taxa such as Afrotheria, and extinct taxa such as the Multituberculata. Some extant taxa are missing due to synonymy. For example, Wikipedia gives the scientific name of Anthony's pipistrelle as Pipistrellus anthonyi, whereas MSW has it as Hypsugo anthonyi.
As an aside, Wikipedia pages often get muddled about parentheses around taxonomic author names. The authority is in parentheses if the current genus is not the original genus the species was placed. Hence, Pipistrellus anthonyi (Tate, 1942) should actually be Pipistrellus anthonyi Tate, 1942, as Tate originally described this taxon as a species of Pipistrellus (see hdl:2246/1783). However, the name Hypsugo anthonyi (Tate, 1942) does need parentheses.


Some Wikipedia taxa also postdate the publication of MSW, such as Philander deltae (see doi:10.1644/05-MAMM-A-065R2.1).


Wikipedia versus MSW
When we do the reverse comparison we see something rather different.

msw-w.jpg


[Larger scale view here]

This is the MSW tree, coloured red where the MSW taxon has a page in Wikipedia. There are big gaps, some of which are due to those pages being in another component (in other words, many "missing" taxa do have pages in Wikipedia, they are just not properly linked to the bigger tree). MSW is also rich in subspecies, which tend to lack their own pages in Wikipedia (possibly a good thing in the cases of taxa such as pocket gophers).

It would be nice to make these comparisons automatic, and develop tools so that managing taxonomy in Wikipedia could be made easier.

Mammal tree from Wikipedia

Following on from my previous post about visualising the mammalian classification in Wikipedia, I've extracted the largest component from the graph for all mammal taxa in Wikipedia, and it is a tree. This wasn't apparent in the previous diagram, where the component appeared as a big ball due to the layout algorithm used.
tree.jpg


What this suggests is that Wikipedia contributors are quite capable of generating trees, it's just that not all the bits of the tree are connected (hence all the components in the previous post.

As Cyndy Parr suggested in her comments, it would be useful to compare the Wikipedia-derived tree with other trees, say from Mammal species of the World or ITIS.

Visualising the Wikipedia classification of mammals

As part of my on-going experiments with Wikipedia as a repository of taxonomic information, I've extracted mammal pages from Wikipedia. There's a lot to be done with these, but the first thing I wanted to ask was whether the Wikipedia pages would form a tree (i.e., had the authors of these pages managed to ensure the pages formed a single, coherent taxonomic classification). The answer, as shown in the graph below, is no.
m.jpg


The graph contains 7750 nodes, each one representing a Wikipedia page with a Taxobox containing the class Mammalia. A node is connected to the node corresponding to its parent in the mammalian classification.

If it formed a single classification there would be just one component. Instead, it contains 841 distinct components, many of which you can see at the bottom. If you want to explore the graph, I've made an image map here using the wonderful graph editor yEd. You'll need to move the browser's scroll bars to see the graph. If you click on the node you'll be taken to the corresponding Wikipedia page.

Note: The graph has been laid out using yEd's organic layout command, so it won't look tree-like. The diagram is intended to testing for connectedness only.

Some of these components may be due to errors in my parser, but many are due to inconsistencies in Wikipedia. Typical problems are Taxoboxes containing taxa for which there is no page in Wikipedia (these are visible as redlinks), or monotypic taxa where the pages for the genus and species are the same).

Of course, the joy of Wikipedia is that these problems can be easily fixed, but the trick is discovering the problems in the first place. There is a distinct lack of tools to enable Wikipedia editors to view the entire classification of interest and identify areas that need fixing (something Roger Hyam alluded to in his comment on an earlier posting). It would, of course, be great to be able to edit the graph shown above and have those changes automatically transmitted to Wikipedia.