Search this keyword

Showing posts with label Wikipedia. Show all posts
Showing posts with label Wikipedia. Show all posts

Google Knowledge Graph using data from BBC and Wikipedia

Google's Knowledge Graph can enhance search results by display some structured information about a hit in your list of results. It's available in the US (i.e., you need to use www.google.com, although I have seen it occasionally appear for google.co.uk.

Fruitbat
Here is what Google displays for Eidolon helvum (the straw-coloured fruit bat). You get a snippet of text from Wikipedia, and also a map from the BBC Nature Wildlife site. Wikipedia is a well-known source of structured data (in that you can mine the infoboxes for information). The BBC site has some embedded RDFa and structured HTML, and you can also get RDF (just append ".rdf" to the URL, i.e., http://www.bbc.co.uk/nature/life/Straw-coloured_Fruit_Bat.rdf). There doesn't seem to be anything in the RDF about the distribution map, so presumably Google are extracting that information from the HTML.

It would be interesting to think about what other biodiversity data providers, such as GBIF and EOL could do to get their data incorporated into Google's Knowledge Graph, and eventually into these search result snippets.

Linking NCBI taxonomy to GBIF


In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:

<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Paper on NCBI and Wikipedia published in PLoS Currents: Tree of Life

__logo__1.jpg
My paper describing the mapping between NCBI and Wikipedia has been published in PLoS Currents: Tree of Life. You can see the paper here. It's only just gone live, so it's yet to get a PubMed Central number (one of the nice features of PLoS Currents is that the articles get archived in PMC).

Publishing in PLoS Currents: Tree of Life was a pleasant experience. The Google Knol editing environment was easy to use, and the reviewing process quick. It's obviously a new and rather experimental journal, and there are a few things that could be improved. Automatically looking up articles by PubMed identifier is nice, but it would also be great to do this for DOIs as well. Furthermore, the PubMed identifiers aren't displayed as clickable links, which rather defeats the point of having references on the web (I've added DOI links to the articles wherever possible). But, minor grumbles aside, as a way to get an Open Access article published for free, and have it archived in PubMed Central, PLoS Currents is hard to beat. What will be interesting is whether the article receives any comments. This seems to be one area online journals haven't really cracked — providing an environment where people want to engage in discussion.

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.

Zooming a large tree, now with thumbnails

Continuing experiments with a zoom viewer for large trees (see previous post), I've now made a demo where the labels are clickable. If the NCBI taxon has an equivalent page in Wikipedia the demo displays and link to that page (and, if present, a thumbnail image). Give it a try at

http://iphylo.org/~rpage/deeptree/3.html

or watch the short video clip below:

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:

myzo.png


troidini.png


protea.png


Snapshot 2010-12-15 10-38-02.png


Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:

treebasephylogeny.png

Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?

Wikipedia paper out

cover-medium.jpgMy short note on "Wikipedia as an Encyclopaedia of Life" has appeared in Organisms Diversity & Evolution (doi:10.1007/s13127-010-0028-9) (yes, I do occasionally write papers). A preprint of this paper is available on Nature Precedings (hdl: 10101/npre.2010.4242.1).

My presentation at iEvoBio covers much the same ground, and is included below, although the paper was written before I made the mapping from NCBI taxa to Wikipedia pages.


Mashing up NCBI and Wikipedia using treemaps

Having made a first stab at mapping NCBI taxa to Wikipedia, I thought it might be fun to see what could be done with it. I've always wanted to get quantum treemaps working (quantum treemaps ensure that the cells in the treemap are all the same size, see my 2006[!] blog post for further description and links). After some fussing I have some code that seems to do the trick. As an example, here is a quantum treemap for Laurasiatheria.

qt.png
The diagram shows the NCBI taxonomy subtree rooted on Laurasiatheria, with images (where available) from Wikipedia for the children of the the children of that node. In other words, the images correspond to the tips of the tree below:

laurasiatheria.png

There's a lot to be done to tidy this up, but there is potential to create a nice, visual way to navigate through the NCBI taxonomy (it might work well on the iPhone or iPad, for example).

NCBI to Wikipedia links are now live...

The 52,956 links from NCBI to Wikipedia that I've been busy creating are now "live." If you go to a NCBI taxon such as Sphaerius you'll see something like this:

linkout.png

Clicking the "Wikipedia" link takes you to the Wikipedia page for this taxon. You can see all the links to Wikipedia using the query loproviphylo[filter]. Here are some additional links to try:

NCBIWikipedia
8353Xenopus
83698Banksia
9766 Balaenoptera

Thanks to Scott Federhen and Kathy Kwan at NCBI for all their assistance in getting this into NCBI Linkout.

Fixing errors
There will be errors and omissions. The best way to fix these is by using the iPhylo Linkout wiki. The page for a NCBI taxon is always http://iphylo.org/linkout/Ncbi:xxxx where xxxx is the NCBI taxonomy id. You can edit/annotate the link there (click on the "edit with form" for a simple web form). I plan to regularly update the links based on this the wiki.

Future
NCBI Linkout provide access statistics so it will be interesting to see how much traffic goes from NCBI to Wikipedia. It will also be interesting to see if this is correlated with increased editing of those Wikipedia pages.

Linking NCBI to Wikipedia

180px-Sphaerius.acaroides.Reitter.tafel64.jpgIn an earlier post I discussed linking NCBI taxonomy to Wikipedia. One way to tackle this is to add NCBI Taxonomy ID to Wikipedia pages. I reopened the case for adding the Taxonomy IDs to the Taxobox on each taxon page, but this met with substantial resistance. A modified proposal to add them elsewhere to the Wikipedia page seems to be gaining more support (or, at least, less vigorous resistance).

Meanwhile, there are other things that need to be done to linking NCBI and Wikipedia. One is to add Wikipedia page names to NCBI Linkout so that when viewing a NCBI taxon page you will see a link to Wikipedia if a page for the corresponding taxon exists. To create this linkout we need a mapping from NCBI to Wikipedia, and that's what I've been working on for the last few days.

The mapping is still in progress, but essentially I've taken a dump of the NCBI taxonomy for June 3, 2010, and matched the names with those in a the June 18, 2009 dump of Wikipedia that I've analysed elsewhere on this blog. I'll detail the various steps in the mapping elsewhere (there are issues such as synonyms, homonyms, Wikipedia redirects, etc.), but for now things seem to be working reasonably well.

The mapping is being created in a Semantic Mediawiki at http://iphylo.org/linkout/. When complete you will be able to up a NCBI taxon by either it's name (including synonyms and common names) or it's NCBI Taxonomy ID. Where possible I'm mapping the NCBI taxon to Wikipedia, and providing a snippet of text and an image.

I've also extracted bibliographic information from the citations.dmp file that comes with the NCBI dump. This contains the comments that you sometimes see on a taxon page. In a few cases I've added some information manually. For example, the beetle genus Sphaerius has a rather complicated nomenclatural history, which the NCBI page summarises as:
Due to a recent ruling (ICZN 2000), the family and generic names Sphaeriusidae Erichson, 1845, and Sphaerius Waltl, 1838, are both available names and have priority over Microsporidae Crotch, 1873 for the family name and Microsporus Kolenati, 1846 for the single included genus, respectively.

By looking through BioStor I've found some of the papers relating to this ICZN ruling, and added them to the wiki page http://iphylo.org/linkout/Ncbi:174920 (aficionados of zoological nomenclature may enjoy the complexity of the case, due to homonymy between the corresponding family name, Sphaeriidae, and a mollusc family of the same name).

Once thus mapping is complete, it will be time to think of how to get this into NCBI's Linkout, and also how to automatically update the mapping to reflect the growth of both the NCBI taxonomy and Wikipedia. If you visit http://iphylo.org/linkout/ please be aware that the mapping is still being written to the wiki (this is being done via API calls, and adding some 900,000 pages is going to take a while).

NCBI Taxonomy IDs and Wikipedia

Wikipedia-logo-v2-en.png
36388.gif

I've written a note on the Wikipedia Taxobox page making the case for adding NCBI taxonomy IDs to the standard Taxobox used to summarise information about a taxon. Here is what I wrote:

Wikipedia's taxon pages have a huge web presence (see my blog post Google and Wikipedia revisited and Page, R. D. M. (2010). "Wikipedia as an encyclopaedia of life". Nature Precedings hdl:10101/npre.2010.4242.1). If a taxon is in Wikipedia it is almost always the first search result in Google. Researchers in other areas of biology are making use of a Wikipedia as a tool to annotate genes Gene Wiki and RNA families Wikipedia:WikiProject_RNA, respectively. Pages for genes, such as Cytochrome_b, have numerous external identifiers in their equivalent of the Taxobox (the Pfam_box). I think we are missing a huge opportunity by not including NCBI taxonomy ids. The advantages would be:

  • It would provide a valuable service to Wikipedia readers by enabling them to go to NCBI to discover more about a taxon

  • It would help Wikipedia contributors by providing a standardised way to refer to NCBI (and enable bots to add missing NCBI taxonomy ids). Putting them in an External links section makes it harder to be consistent (there are various ways to write a URL linking to the NCBI taxonomy)

  • It would facilitate linking from NCBI to Wikipedia. A mapping of Wikipedia pages to NCBI taxonomy ids could be added to NCBI Linkout, generating more traffic to the Wikipedia pages

  • Projects that are trying to integrate information from different sources would be able to combine information of genomics from NCBI with other information much more readily

Note that I am not arguing that Wikipedia should "follow" NCBI taxonomy, merely that where the potential to link exists, the links would create value, both within and outside the Wikipedia community.

Some discussion has ensued on the Taxobox page, all positive. I'm blogging this here to encourage anyone who as any more thoughts on the matter to contribute to the discussion.

How Wikipedia can help scope a project

I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?

It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.

In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.

Wikipedia manuscript

npre20104242-1.thumb.pngI've written up some thoughts on Wikipedia for a short invited review to appear (pending review) in Organisms, Environment, and Diversity (ISSN 1439-6092). The manuscript, entitled "Wikipedia as an encyclopaedia of life" is available as a preprint from Nature Precedings (hdl:10101/npre.2010.4242.1). The opening paragraph is:
In his 2003 essay E O Wilson outlined his vision for an "encyclopaedia of life" comprising "an electronic page for each species of organism on Earth", each page containing "the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits." Although the "quiet revolution” in biodiversity informatics has generated numerous online resources, including some directly inspired by Wilson’s essay (e.g., http://ispecies.org, http://www.eol.org), we are still some way from the goal of having available online all relevant information about a species, such as its taxonomy, evolutionary history, genomics, morphology, ecology, and behaviour. While the biodiversity community has been developing a plethora of databases, some with overlapping goals and duplicated content, Wikipedia has been slowly growing to the point where it now has over 100,000 pages on biological taxa. My goal in this essay is to explore the idea that, largely independent of the efforts of biodiversity informatics and well-funded international efforts, Wikipedia (http://en.wikipedia.org/wiki/Main_Page) has emerged as potentially the best platform for fulfilling E O Wilson’s vision.

The content will be familiar to readers of this blog, although the essay is perhaps a slightly more sober assessment of Wikipedia than some of my blog posts would suggest. It was also the first manuscript I'd written in MS Word for a while (not a fun experience), and the first ever for which I'd used Zotero to manage the bibliography (which worked surprisingly well).

EOL, the BBC, and Wikipedia

Last month EOL took the brave step of including Wikipedia content in its pages. I say "brave" because early on EOL was pretty reluctant to embrace Wikipedia on this scale (see the report of the Informatics Advisory Group that I chaired back in 2008), and also because not all of EOL's curators have been thrilled with this development. Partly to assuage their fears, EOL displays Wikipedia-derived content on a yellow background to flag its "unreviewed" status, such as this image of the python genus Leiopython:

Leiopython.png


It's interesting to compare EOL's approach to Wikipedia with that taken by the BBC, as documented in Case Study: Use of Semantic Web Technologies on the BBC Web Sites. The BBC makes extensive use of content from community-driven external sites such as MusicBrainz and Wikipedia. They embed the content in their own pages, stating where the content came from, but not flagging it as any less meaningful or reliable than the BBC's own content (i.e., no garish yellow background).

Furthermore, the BBC does two clever things. Firstly:
To facilitate integration with the resources external to bbc.co.uk the music site reuses MusicBrainz URL slugs and Wildlife Finder Wikipedia URL slugs. This means that it is relatively straight forward to find equivalent concepts on Wikipedia/DBpedia and Wildlife Finder and, MusicBrainz and /music.


This means that if the identifier for the artist Bat for Lashes in Musicbrainz is http://musicbrainz.org/artist/10000730-525f-4ed5-aaa8-92888f060f5f.html, the BBC reuse the "slug" 10000730-525f-4ed5-aaa8-92888f060f5f and create a page at http://www.bbc.co.uk/music/artists/10000730-525f-4ed5-aaa8-92888f060f5f. Likewise, if the Wikipedia page for Varanus komodoensis is http://en.wikipedia.org/wiki/Komodo_dragon, then the BBC Wildlife Finder page becomes http://www.bbc.co.uk/nature/species/Komodo_dragon, reusing the slug Komodo_dragon.

komodo.png


Reusing identifiers like this can greatly facilitate linking between databases. I don't need to do a search, or approximate string matching, I just reuse the slug. Note that this is a two-way thing, it is trivial for Musicbrainz to create links to BBC information, and visa versa. Reusing identifiers isn't new, other examples include Amazon.com's ASIN (which for books are ISBNs), and BHL reuses uBio NameBankIDs -- want literature that mentions the Komodo dragon? Use the uBio NameBankID 2546401 in a BHL URL http://www.biodiversitylibrary.org/name/2546401.

The second clever thing the BBC does is treat the web as a content management system:

BBC Music is underpinned by the Musicbrainz music database and Wikipedia, thereby linking out into the Web as well as improving links within the BBC site. BBC Music takes the approach that the Web itself is its content management system. Our editors directly contribute to Musicbrainz and Wikipedia, and BBC Music will show an aggregated view of this information, put in a BBC context.


Instead of separating BBC and Wikipedia content (and putting the later in quarantine as does EOL), the BBC embraces Wikipedia, editing Wikipedia content if they feel a page need improving. One advantage of this approach is that it avoids the need for the BBC to replicate Wikipedia, either in terms of content (the BBC doesn't need to write its own descriptions of what an organism does) or services (the BBC doesn't need to develop tools for people to edit the BBC pages, people use Wikipedia's infrastructure for this). Wikipedia provides core text and identifiers, BBC provides its own unique content and branding.

EOL is trying something different, and perhaps more challenging (at least to do it properly). Given that both EOL and Wikipedia offer text about organisms, there is likely to be overlap (and possibly conflict) between what EOL and Wikipedia say about the same taxon. Furthermore, there will be duplication of information such as bibliographic references. For example, the Wikipedia content included in the EOL page for Leiopython contains a bibliography, which includes these references:

Hubrecht AAW. 1879. Notes III on a new genus and species of Pythonidae from Salawatti. Notes from the Leyden Museum 14-15.

Boulenger GA. 1898. An account of the reptiles and batrachians collected by Dr. L. Loria in British New Guinea. Annali del Museo Civico de Storia Naturale di Genova (2) 18:694-710

The genus name Leiopython was published by Hubrecht (1879), and Boulenger (1898) is cited in support of a claim that a distribution record is erroneous. Hence, these look like useful papers to read. Neither reference on the Wikipedia page is linked to an online version of the article, but both have been scanned by EOL's partner BHL (you can see the articles in BioStor here, and here, respectively)1.

Problem is, you'd be hard pressed to discover this from the EOL page. The BHL results do list the journal Notes from the Leyden Museum, but you'd have to visit the links manually to discover whether they include Hubrecht (1879) (they do, as well as various occurences of Leiopython in the indices for the journal). In part this problem is a consequence of the crude way EOL handles bibliographies retrieved from BHL, but it's symptomatic of a broader problem. By simply mashing EOL and Wikipedia content together, EOL is missing an opportunity to make both itself and Wikipedia more useful. Surely it would be helpful to discover what publications cited on Wikipedia pages are in BHL (or in the list of references for hand-curated EOL pages)? This requires genuine integration (for example by reusing existing bibliographic identifiers such as DOIs, and tools such as OpenURL resolvers). If it fails to do this, EOL will resemble crude pre-Web 2.0 mashups where people created web pages that had content from external sites enclosed in <IFRAME> tags.

The contrast between the approaches adopted by EOL and the BBC is pretty stark. The BBC has devolved text content to external, community-driven sites that it thinks will do a better job than the BBC could alone. EOL is trying to integrate Wikipedia into it's own text content, but without addressing the potentially massive duplication (and, indeed, possible contradictions) that are likely to arise. Perhaps it's time for EOL to be as brave as the BBC, as ask itself whether it is sensible for EOL to try and occupy the same space as Wikipedia.

1. Note that the bibliographic details of both papers are wanting, Hubrecht 1879 is in volume 1 of Notes from the Leyden Museum, and Annali del Museo Civico de Storia Naturale di Genova series 2, volume 18 is also treated as volume 38.

Wikipedia and Gregg's paradox

Continuing the theme of taxonomic classification in Wikipedia, I'm perversely delighted that Wikipedia demonstrates Gregg's paradox so nicely.

1s2mges1n3b5q1bnvf5a3i4u8y_2009-05-31.jpgThe late John R. Gregg wrote several papers and a book exploring the logical structure of taxonomy. His 1954 book The language of taxonomy stimulated a debate a decade later in Systematic Zoology concerning what Buck and Hull (1966) (doi:10.2307/2411628) termed "Gregg's Paradox".

Gregg showed that if we (a) treat taxa as sets defined by extension (i.e., by listing all members), and (b) accept that two sets with exactly the same content must be the same set, then many biological classifications violate these premises because the same taxon may be assigned to multiple levels in the Linnean hierarchy. For example, the aardvark, Orycteropus afer, is the only extant species of the genus Orycteropus, which is the only extant member of the family Orycteropodidae, which in turn is the sole extant representative of the order Tubulidentata. Under Gregg's model, Tubulidentata, Orycteropodidae, and Orycteropus are all the same thing as they have exactly the same content (i.e., Orycteropus afer). Put another way, monotypic taxa are redundant and violate basic set theory. Gregg would argue that they should be eliminated.

aardvark.pngWikipedia illustrates this nicely. Wikipedia conforms to Gregg's model in that taxa are defined by extension (each taxon comprises one or more wiki pages), and if taxa have the same content only one taxon (typically that with the lowest taxonomic rank) has a page in Wikipedia. Put another way, if the aardvark is the sole representative of the Tubulidentata, then there is nothing that could be put on the Tubulidentata page that shouldn't also belong on the page for the aardvark. As a result, the page for the aardvark gives a full classification of this animal, but most taxa in the hierarchy don't have their own pages.

Responses

There are several possible responses to Gregg's paradox. One is to argue that taxa should be defined intentionally (i.e., on the basis of their characters), which was Buck and Hull's approach. Essentially, they were arguing that we could (somewhat arbitrarily) specify properties of Orycteropodidae that weren't shared by all Tubulidentata, and hence we are justified in keeping these taxa separate. Gregg himself was less than impressed by this argument (doi:10.2307/2412017).

Another approach is to suggest that we may discover taxa in the future that will, say, be members of Orycteropus but which aren't O. afer, and that the taxa between the rank suborder and species are placeholders for these discoveries. Indeed, in the case of the Tubulidentata there are extinct aardvarks (doi:10.1163/002829675x00137, doi:10.1016/j.crpv.2005.12.016, and doi:10.1111/j.1096-3642.2008.00460.x) that could be added to Wikipedia, thus justifying the creation of pages for the taxa that Gregg would have us eliminate.

Of course, Gregg's paradox is a consequence of having ranks and requiring each rank (or at least a reasonable subset of them) to exist in a classification. If we ignore ranks, then there's no reason to put any taxa between Afrotheria and Orycteropus afer. So, we could drop this requirement for having taxa at each rank or, of course, drop ranks altogether, which is one of the motivations behind phylogenetic classifications (e.g., the phylocode).

Implications for parsing Wikipedia

From a practical point of view, Gregg's paradox means that one has to be careful parsing Wikipedia Taxoboxes. As I've argued earlier, the simplest way to ensure that a classification is a tree is for each taxon to include a unique parent taxon. The simplest way to extract this for a taxon in a Wikipedia page would be to retrieve the taxon immediately above it in the classification (i.e., for Orycteropus afer this would be Orycteropus). But Orycteropus doesn't have a page in Wikipedia (OK, it does, but it's a redirect to the page for the aardvark). So, we have to go up the classification until we hit Afrotheria before we get a taxon page.

Personally I quite like the fact that a largely forgotten argument from the middle of the last century concerning logic and Linnean taxonomy seems relevant again.

Wikipedia's taxonomic classification is badly broken

Wikipedia is wonderful, but parts of it are horribly broken. Take, for example, taxonomic classifications. A classification is a rooted tree, which means that each node in the tree has a single parent. We can store trees in databases in a variety of ways. For example, for each node we could store a list of its children, or we could store the single unique parent of each node. Ideally we'd choose to store one or other, but not both. If we store both sets of statements (i.e., that node A has node B as one of its children, and that node B's parent is node A) then there is enormous potential for these two statements to get out of sync.
tree.png


This is what has happened in Wikipedia. Each page for a taxon lists the lineage to which it belongs (i.e., its parent, and its parent's parent, and so on), and also lists the children of that node. What this means is that if somebody edits the page for taxon A and adds taxon B as a child, they also need to edit the page for taxon B to make A its parent. If only one of these two edits is made the classification may end up internally inconsistent.

For example, the page for Amphibia lists the classification of Amphibia like this:
a1.png

It also lists the child taxa of Amphibia:
a2.png

So, the children of Amphibia are Temnospondyli, Lepospondyli, and Lissamphibia. Furthermore, Anura, Caudata, and Gymnophiona are children of Lissamphibia:

child.png


Given this, if I go to the pages for Anura, Caudata, and Gymnophiona I should see that each of these taxa lists Lissamphibia as its parent. However, only one of these (Caudata) does: the Anura and Gymnophiona both have Amphibia as their parents, not Lissamphibia.

The diagram below shows the taxa that have Amphibia as their parent:
parent.png


Note that Stegocephalia have now turned up as an addition amphibian order, and that only Caudata is included in Lissamphibia. But what is striking is that another 274 Wikipedia taxon pages have Amphibia as their parent. These pages are all for fossil amphibians that do not fit easily in the existing Wikipedia classification.

From the perspective of building a database, the "has parent" relationship is the one I'd prefer to use, because that statement is going to be made just once (on the page for the taxon of interest). This seems a lot safer than making the statement "has child" on another page (for one thing, more than one page could claim a taxon as their child, which again will break the tree). But if we use the "has parent" relationship, our tree will be very bushy, with lots of fossil amphibian genera attached to the Amphibia node. This is going to make the tree hard to interpret, because this basal bush isn't saying that all these genera radiated off at once, but rather that we don't really know where in the amphibian tree these things go, so we'll have to settle for saying merely "they are amphibians" (for the cladistic theorists among you, this is Nelson and Platnick's "interpretation 2" in their "Multiple Branching in Cladograms: Two Interpretations", doi:10.2307/2412630).

So, the dilemma is whether to use "has child" relationships, and accept that these are likely to be inconsistent with the inverse "has parent" relationship, or use the "has parent" relationship, which will be internally consistent, but at the cost of potentially very large, unresolved bushes due to fossil taxa of uncertain affinities.

When taxonomists wage war in Wikipedia

Stumbled across Alex Wild's post Pyramica vs Strumigenys: why does it matter?, which takes as it's starting point a minor edit war on the Wikipedia page for Pyramica .

Alex gives the background to the argument about whether Pyramica is a synonym of Strumigenys, and investigates the issue using the surprisingly small about of data available in GenBank. The tree he found (shown below) suggests this issue will require some work to resolve:

phylogeny1.jpg


For fun I constructed a history flow diagram for the edits to the Pyramica page in Wikipedia:

5.png


The diagram shows the two occasions when the page has been striped of content (and subsequently restored) as contributors dispute whether Pyramica is a synonym of Strumigenys. It would be useful to have one or more metrics of how controversial a page (and/or a contributor) was, to both identify controversial pages, and to see how controversial taxonomic pages were compared to other Wikipedia topics. The paper On Ranking Controversies in Wikipedia: Models and Evaluation by Ba-Quy Vuong et al. (doi:10.1145/1341531.1341556) would be a good place to start (a video of the presentation of this paper is available here).

Gene Wiki and Google

Andrew Su has posted an analysis of Gene Wiki, a project to provide Wikipedia pages on every human gene:
Here's the take home message: in terms of online gene annotation resources, Gene Cards is the most common top-ranked resource, followed closely by the Gene Wiki / Wikipedia, with NCBI in a very distant third (note the log scale).
top_sites.png

This result is interesting in that an existing resource (Gene Cards) beats Wikipedia, but only just. There are various ways we could interpret this, but from the point of view of biodiversity resources I suspect it emphasises that if there is a good, existing resource that has a lot of traction (i.e., Gene Cards) it will do well in Google Searches. If there is no single dominant resource (as is the case for biodiversity), then it leaves the field open to be dominated by Wikipedia.

Visualising edit history of a Wikipedia page

Quick post (really should be doing something else). Reading Jeff Atwood's post Mixing Oil and Water: Authorship in a Wiki World lead me to IBM's wonderful history flow tool to visualise the edit history of a Wikipedia page.

Imagine a scenario where three people will make contributions to a Wiki page at different points in time. Each person edits the page and then saves their changes to what becomes the latest version of that page.

history-flow-animation.gif

History Flow connects text that has been kept the same between consecutive versions. Pieces of text that do not have correspondence in the next (or previous) version are not connected and the user sees a resulting "gap" in the visualization; this happens for deletions and insertions. (animated GIF from Jeff Atwood's post).


There's a nice paper describing history flow (doi:10.1145/985692.985765, free PDF here). Inspired by this I decided to try and implement history flow in PHP and SVG. Here's a preliminary result:

afrotheria.png

This is the edit history for the Afrotheria page. Click on the image above (or here to see the SVG image -- you need a decent web browser for this, IE uses will need a SVG plugin).

The SVG image is clickable. The columns represent revisions, click on those to go to that revision. The columns are evenly spaced (i.e., the gaps don't correspond to time). The bands between revisions trace individual blocks of text (in this case lines in the Wikipedia page source). If you click on a band you get taken to that Wikipedia user's page.

This is all done in a rush, but it gives an idea of what can be done. The history flow carries all sorts of information about how an article has developed over time, major changes (such as the introduction of Taxoboxes), and makes the content of a page traceable, in the sense that you can see who contributed what to a page.

Google and Wikipedia revisited

Given that one response to my post on Fungi in Wikipedia was to say that fungi are also charismatic, so maybe I should try [insert unsexy taxon name here]. So, I've now looked at all the species I extracted from Wikipedia (nearly 72,000), ran the Google searches, and here are the results:

SiteHow many times is it the top hit?
en.wikipedia.org42515
www.birdlife.org2125
commons.wikimedia.org1522
plants.usda.gov1496
species.wikimedia.org1487
animaldiversity.ummz.umich.edu1419
amphibiaweb.org851
www.calflora.org770
www.fishbase.org727
ibc.lynxeds.com699
davesgarden.com659
www.arkive.org510
ukmoths.org.uk414
zipcodezoo.com368
www.itis.gov304
calphotos.berkeley.edu294
www.floridata.com234
www.planetcatfish.com234
www.eol.org226
www.arthurgrosset.com213


The table lists the top twenty sites, based on the number of times each site occupies the number one place in the Google search results. Surprise, surprise, Wikipedia wins hands down.

What is interesting is that the other top-ranking sites tend to be taxon-specific, such as FishBase, Amphibia Web, and USDA Plants. To me this suggests that the argument that Wikipedia's dominance of the search results is because it focusses on charismatic taxa doesn't hold. In fact, the truly charismatic taxa are likely to have their own, richly informative webs sites that will often beat Wikipedia in the search rankings. If your taxon is not charismatic, then it's a different story. This suggests one of two strategies for making taxon web sites that people will find. Either go for the niche market, and make a rich site for a set of taxa that you (and ideally some others) like, or add content to Wikipedia. Sites that span across all taxa will always come up against Wikipedia's dominance in the search rankings. So, it's a choice of being a specialist, or trying to compete with an über-generalist.