Search this keyword

Why metadata matters

Quick note to express the frustration I experience sometimes when dealing with taxonomic literature. As part of a frankly Quixotic desire to link every article cited in the Australian Faunal Directory (AFD) to the equivalent online resource (for example, in the Biodiversity Heritage Library using BioStor, or to a publisher web site using a DOI) I sometimes come across references that I should be able to find yet can't. Often it turns out that the metadata for the article is incorrect. For example, take this reference:
Report upon the Stomatopod crustaceans obtained by P.W. Basset-Smith Esq., surgeon R.N. during the cruise, in the Australia and China Sea, of H.M.S. "Penguin", commander W.V. Moore. Ann. Mag. Nat. Hist. Vol. 6 pp. 473-479 pl. 20B
which is in the Australian Faunal Directory (urn:lsid:biodiversity.org.au:afd.publication:087892ae-2134-4bb4-83ae-8b8cbd15b299). Using my OpenURL resolver in BioStor I failed to locate this article. Sometimes this is because the code I used to parse references from AFD mangles the reference, but not in this case. So, I Google the title and find a page in the Zoological catalogue of Australia: Aplacophora, Polyplacophora, Scaphopoda:


Here's the relevant part of this page:
Zoocat
Same as AFD, Ann. Mag. Nat. Hist. volume 6, pages 473-479, 1893.

In despair I looked at the BHL page for The Annals and Magazine of Natural History and discover that there is no volume 6 published in 1893. There is, however, series 6. Oops! Browsing the BHL content I discover the start of the article I'm looking for on BHL page 27734740 , volume 11 of series 6 of The Annals and Magazine of Natural History. Gotcha! So, I can now link AFD to BHL like this.

I should stress that in general AFD is an great resource for someone like me trying to link names to literature and, to be fair, with its reuse of volume numbers across series The Annals and Magazine of Natural History can be a challenge to cite. Usually the bibliographic details in AFD are accurate enough to locate articles in BHL or CrossRef, but every so often references get mangled, misinterpreted, or someone couldn't resist adding a few "helpful" notes to a field in the database, resulting in my parser failing. What is slightly alarming is how often when I Google for the reference I find the same, erroneous metadata repeated across several articles. This, coupled with the inevitable citation mutations can make life a little tricky. The bulk of the links I'm making are constructed automatically, but there are a few cases where one is lead on a wild goose chase to find the actual reference.

Although this is an example of why it matters to have accurate metadata, it can also be seen as an argument for using identifiers rather than metadata. If these references had stable, persistent identifiers (such as DOIs) that taxonomic databases cited, then we wouldn't need detailed metadata, and we could avoid the pain of rummaging around in digital archives trying to make sense of what the author meant to cite. Until taxonomic databases routinely use identifiers for literature, names and literature will be as ships that pass in the night.

Why is the Atlas of Living Australia is invisible to Google?

Jeff Atwood, one of the co-founders of Stack Overflow recently wrote a blog post Trouble In the House of Google, where he noted that several sites that scrape Stack Overflow content (which Stack Overflow's CC-BY-SA license permits) appear higher in Google's search rankings than the original Stack Overflow pages. When Stack Overflow chose the CC-BY-SA license they made the assumption that:
...that we, as the canonical source for the original questions and answers, would always rank first...That's why Joel Spolsky and I were confident in sharing content back to the community with almost no reservations – because Google mercilessly penalizes sites that attempt to game the system by unfairly profiting on copied content.
Jeff Atwood's post goes on to argue that something is wrong with the way Google is ranking sites that derive content from other sites.

I was reminded of this post when I started to notice that searches for fairly obscure Australian animals would often return my own web site Australian Faunal Directory on CouchDB as the first hit. In one sense this is personally gratifying, but it can also be frustrating because when I Google these obscure taxa it's usually because I'm trying to find data that isn't already in one of my projects.

unotata.pic1.JPGBut what I've also noticed is that the site that I obtained the data from, Australian Faunal Directory (AFD), rarely appears in the Google search results. In fact, there are taxa for which Google doesn't find the corresponding page in AFD. For example, if you search for Uxantis notata (shown here in an image from the Key to the planthoppers of Australia and New Zealand) the first hit(s) are from my version of AFD:
Snapshot 2011-02-06 14-05-44.png


Neither the original AFD, nor the Atlas of Living Australia (ALA), which also builds on AFD, appear in the top 10 hits.

Initially I though this is probably an artefact. This is a pretty obscure taxon, maybe things like rounding error in computing PageRank are going to affect search rankings more than anything else. However, if I explicitly tell Google to search for Uxantis notata in the domain environment.gov.au I get no hits whatsoever:

Snapshot 2011-02-06 14-10-32.png

Likewise, the same search restricted to ala.org.au finds nothing, nothing at all. Both AFD and Atlas of Living Australia have pages for this taxon, here, and here, so clearly something is deeply wrong.

Why are the original providers of the data not appearing in Google search results at all? For someone like me who argues that sharing data is a good thing, and sites that aggregate and repurpose data will ultimately benefit the original data providers (for example by sending traffic and Google Juice) this is somewhat worrying. It seems to reinforce the fear that many data providers have: "if I share my data someone will make a better web site than mine and people will go to that web site, rather than the one I've created with my hard-won data." It may well be that data aggregators will score higher than data providers in Google searches, but I hadn't expected data providers to be virtually invisible.

atlasaustraliasm.gifGoogle isn't the problem
If a web site that I hacked together in a few days does better in Google searches than the rather richer pages published by sites such as ALA (with a budget of over $AU 30 million), something is wrong. Unlike the Stack Overflow example discussed above, I don't think the problem here is with Google.
If we search in Google for an "iconic" Australian taxon by name, say the Koala Phascolarctos cinereus, Wikipedia is the first hit (which should be no surprise). ALA doesn't appear in the top ten. If we tell Google to just search the domain ala.org.au we get lots of pages from ALA, but not the actual species page for Phascolarctos cinereus. This suggests that there is something about the way ALA's website works that prevents Google indexing it properly. I'm also a little worried that a major biodiversity project which has as its aim
...to improve access to essential information on Australia’s biodiversity
is effectively invisible to Google.



Web Hooks and OpenURL: the screencast

Yesterday I posted notes on Web Hooks and OpenURL. That post was written when I was already late (you know, when you say to yourself "yeah, I've got time, it'll just take 5 minutes to finish this..."). The Web Hooks + OpenURL project is still very much a work in progress, but I thought a screen cast would help explain why I think this is going to make my life a lot easier. It shows an example where I look at a bibliographic record in one database (AFD, the Australian Faunal Directory on CouchDB), click on a link that takes me to BioStor — where I can find the reference in BHL — then simply click on a button on the BioStor page to "automagically" update the AFD database. The "magic" is the Web Hook. The link I click on in the AFD database contains the identifier for that entry in the AFD, as well a a URL BioStor can call when it's found the reference (that URL is the "web hook").

Using Web Hooks and OpenURL from Roderic Page on Vimeo.