Search this keyword

Fungi in Wikipedia

One response to the analysis I did of the Google rank of mammal pages in Wikipedia is to suggest that Wikipedia does well for mammals because these are charismatic. It's been suggested that for other groups of taxa Wikipedia might not be so prominent in the search results.

As a quick test I extracted the 1552 fungal species I could find in Wikipedia and repeated the analysis. If anything, the results are more dramatic:
Untitled Image.png


Once again, Wikipedia dominates the search rankings. Over 75% of the pages are the top hit in Google. More specialist fungal sites, such as CAB Abstracts Plus and the American Phytopathological Society's online database do pretty well. EOL and the nomenclatural database Index Fungorum barely make an appearance.

If fungi are less "charismatic" than mammals, the implication is that the less charismatic the taxon, the better Wikipedia does (perhaps there is less competition from other sites). Of course, Wikipedia is severely underpopulated with fungal pages, so one could argue that for fungi not in Wikipedia, sites like EOL may do better (relative to other sites), but that would need to be tested. I suspect that sites that provide more broadly useful information (such as APSnet) will continue to dominate the search rankings, followed by scientific articles (for the fungi in Wikipedia the publishers Springer, Wiley, and Elsevier all appear in the top of sites that appear in the Google rankings).

Wikipedia mammals and the power law

Playing a bit more with the Wikipedia mammal data, there are some interesting patterns to note. The first is that rank the mammal pages by size (here defined as the number of characters in the source for the page) and plot size against rank then we get a graph that looks very much like a power law:
pow1.png

There are a few large pages on mammals (these are on the left), and lots of small pages (the long tail on the right). If we do a log-log plot we get this:
pow2.png

The straight line is characteristic of a power law. The dip at the far right reflects the fact that Wikipedia pages have a minimum size (for example, they must include a Taxobox). Now, this is a bit crude (I should probably look at "Power-law distributions in empirical data" arXiv:0706.1062v2 before getting too carried away), but power laws are characteristic of the link structure of the web (a few big sites with huge numbers of links, huge numbers of sites with few links), and indeed of at least parts of Wikipedia, such as the Gene Wiki project (see doi:10.1371/journal.pbio.0060175).

In this context, the diagrams are showing that even if mammals are "charismatic megafauna", most of them aren't that charismatic. Wikipedia mammal pages are mostly small. This raises the question of whether the high frequency in which Wikipedia mammal pages appeared in the top of Google searches might be attributed to those large pages on (presumably) charismatic mammals. If this were the case, then we'd expect that small pages wouldn't rank highly in Google searches. So, I plotted page size against Google search rank for the Wikipedia mammal pages:
sizexrank.png

This is a box plot, where the grey boxes represent 50% of the distribution of page size (the horizontal black line is the median), and extreme values are shown as circles. Note that "0" is the highest rank (i.e., the first hit in Google), and 9 is the lowest.

While, not surprisingly, most large Wikipedia pages do well in Google searches, and rarely are large pages low down the rankings, my sense is that small pages can have any rank, from top (0) to bottom (9). If page size (i.e., which is a crude measure of the effort put into editing a Wikipedia page) is a measure of "charisma" (contributors are more likely to edit pages on animals that lots of people know about), then charisma isn't a great predictor of where you come in Google's search results. It's not about size, it's about being in Wikipedia.

Google, Wikipedia, and EOL

One assumption I've been making so far is that when people search for information on an organism using its scientific name, Wikipedia will dominate the search results (see my earlier post for an example of this assumption). I've decided to quantify this by doing a little experiment. I grabbed the Mammal Species of the World taxonomy and extracted the 5416 species names. I then used Google's AJAX search API to look up each name in Google. For each search I took the top 10 hits and recorded for each hit the site URL and the rank in the search results (i.e., 1-10). Below is a table of how many mammal species had a hit in the top 10 Google results (showing just the top 20 most frequent sites).
SiteHits
en.wikipedia.org5266
species.wikimedia.org2934
animaldiversity.ummz.umich.edu2890
commons.wikimedia.org1515
www.itis.gov1418
ctd.mdibl.org1288
www.bioone.org1101
www.uniprot.org1086
encyclopedia.farlex.com1007
www.thewebsiteofeverything.com955
www.answers.com864
vertebrates.si.edu854
www.interaktv.com842
www.arkive.org775
linkinghub.elsevier.com727
www.springerlink.com656
www.eol.org618
www.reference.com576
doi.wiley.com572
noctilio.com566


Wikipedia is the clear winner, with 5266 (97%) of mammals having a Wikipedia page in the top ten Google results. Next comes Wikispecies, then Animal Diversity Web, Wikimedia Commons, ITIS, the Comparative Toxicogenomics Database, BioOne, UniProt (derived from the NCBI taxonomy), and so on. Note that the Encyclopedia of Life comes in 17th.

Things get more interesting if we look at the ranking of search results. The graph below plots the cumulative rank of search results for some of the web sites listed above.
ranks.png

Wikipedia dominates things. For 48% of all mammal species Wikipedia is the first result returned by Google. Just under three quarters of all mammal species are either the first or second top hit in Google. The next best sites are Animal Diversity Web and Wikispecies, which get a small share of first place for some species (19% and 7% respectively). Note that EOL pages manage to make it into the top 10 for only 11% of all mammal species.

What does this all mean? Well, it seems clear that if people are using Google to find information about an organism, then Wikipedia is more likely than anything else to be the first result they see. It is also interesting that for all the energy (and funds) being expended on biodiversity databases (doi:10.1126/science.324_1632), ITIS is the only classical biodiversity database that routinely gets found in these searches (albeit in only a quarter of the searches).

I know I tend to go on a bit about EOL, but if I was running (or funding) EOL, I'd be worried. EOL barely figures in these search results, and is being taken to the cleaners by a volunteer effort (Wikipedia). Furthermore, it seems difficult to envisage what EOL can do to improve things. Sure it can link to (and make use of) content in sites such as Animal Diversity Web, ITIS (and maybe even, gasp, Wikipedia), but that just adds "link love" to those sites. Ironically, perhaps the single thing that would improve EOL's ranking would be if Wikipedia spread some of its link love over EOL, by linking all it's taxon pages to the corresponding EOL page.

But there are bigger issues at stake. Site popularity on the web tends to follow a power law, where a very few web sites grab the vast majority of eye balls. In a old blog post Clay Shirky wrote:

Now, thanks to a series of breakthroughs in network theory by researchers ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options. We also know that as the number of options rise, the curve becomes more extreme. This is a counter-intuitive finding - most of us would expect a rising number of choices to flatten the curve, but in fact, increasing the size of the system increases the gap between the #1 spot and the median spot.


So, creating new and improved biodiversity web sites is likely to have the effect of only increasing the gap between Wikipedia and the rest.

Lastly, as I've mentioned before regarding Wikipedia and citations of taxonomic work, the graph above suggests to me that for anybody wanting to make basic biodiversity information available on the web, and attract readers to basic taxonomic literature, there really is only one game in town.