Search this keyword

I think I now "get" the Encylopedia of Life

The Encylopedia of Life (EOL) has been relaunched, with a new look and much social media funkiness. I've been something of an EOL sceptic, but looking at the new site I think I can see what EOL is for. Ironically, it's not really about E. O. Wilson's original vision (doi:10.1016/S0169-5347(02)00040-X:
Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linking to other data bases, such as ARKive, Ecoport, GenBank and MORPHOBANK. It comprises a summary of everything known about the species’ genome, proteome, geographical distribution, phylogenetic position, habitat, ecological relationships and, not least, its practical importance for humanity.
We still lack a decent database that does this. EOL tries, but in my opinion still falls short, partly because it isn't nearly aggressive enough in harvesting and linking data (links to the primary literature anyone?), and has absolutely no notion of phylogenetics.

In terms of doing science I don't see much that I'd want to do with EOL, as opposed, say, to Wikipedia or existing taxonomic databases. But thinking about other applications, EOL has a lot of potential. One nice feature is the ability to make "collections". For example, Cyndy Parr has created a collection called Fascinating textures, which is simply a collection of images in EOL (I've included some below):

Textures
What is nice about this is that it cuts across any existing classification and assembles a set of taxa that share nothing other than having "fascinating textures". This ability to tag taxa means we could create all sorts of interest sets of taxa based on criteria that are meaningful in a particular context. For example, egotist that I am, I created a collection called Taxa described by Roderic Page, which includes the one crab and 6 bopyrid isopods that I described in the 80's.

Putting on my teaching hat, I'm involved in teaching a course on animal diversity and could imagine assembling collections of taxa relevant to a particular lecture (either taxonomically, or based on some other criteria, such as all parasites of a particular taxon, or all organisms found associated with deep sea vents. Other collections could be built by people or organisations with content. For example, lists of top ten new species, lists of species for which the BBC has content, etc.

In this sense, EOL becomes a tagging service for life, a bit like delicious. The social network side of things is still a little clunky —there doesn't seem to be a notion of "contacts" or "friends", and it needs integration with existing social networks — but I think I now "get" what EOL is for.

Phantom articles: why Mendeley needs to make duplication transparent

Browsing Mendeley I found the following record: http://www.mendeley.com/research/description-larva/. This URL is for a paper
Costa, J. M., & Santos, T. C. (2008). Description of the larva of. Zootaxa, 99(2), 129-131
which apparently has the DOI doi:10.1645/GE-2580.1. This is strange because Zootaxa doesn't have DOIs. The DOI given resolves to a paper in the Journal of Parasitology:
Harriman, V. B., Galloway, T. D., Alisauskas, R. T., & Wobeser, G. A. (2011). Description of the larva of Ceratophyllus vagabundus vagabundus (Siphonaptera: Ceratophyllidae) from nests of Rossʼs and lesser snow geese in Nunavut, Canada. The Journal of parasitology, 93(2), 197-200
Now, this paper has it's own record in Mendeley.

OK, so this is weird..., but it gets weirder. If you look at the Mendeley page for this chimeric article there is a PDF preview of yet another article:
LOPES, Maria José Nascimento; FROEHLICH, Claudio Gilberto and DOMINGUEZ, Eduardo (2003). Description of the larva of Thraulodes schlingeri (Ephemeroptera, Leptophlebiidae). Iheringia, Sér. Zool. 92(2), 197-200 2003 doi:10.1590/S0073-47212003000200011
Mendeley duplicate

But it gets even more interesting. The abstract for the phantom Zootaxa article belongs to yet another paper:
Marques, K. I. D. S., & Xerez, R. D.Description of the larva of Popanomyia kerteszi James & Woodley (Diptera: Stratiomyidae) and identification key to immature stages of Pachygastrinae. Neotropical Entomology, 38(5), 643-648.
which also exists in Mendeley.

To investigate further I used Mendeley's API to retrieve this record (I had to look at the source of the web page to find the internal identifier used by Mendeley, namely 010c48d0-edb5-11df-99a6-0024e8453de6 to do this, why does Mendeley hide these?). Here's the abbreviated JSON for this record.

{
...
"website": "http:\/\/www.ncbi.nlm.nih.gov\/pubmed\/21506868",
"identifiers": {
"pmid": "21506868",
"issn": "19372345",
"doi": "10.1645\/GE-2580.1"
},
...
"issue": "2",
"pages": "129-131",
"public_file_hash": "fe7eed3f6c43a3be1480a0937229b9ad33666df4",
"publication_outlet": "Zootaxa",
"type": "Journal Article",
"mendeley_url": "http:\/\/www.mendeley.com\/research\/description-larva\/",
"uuid": "010c48d0-edb5-11df-99a6-0024e8453de6",
"authors": [
{
"forename": "J M",
"surname": "Costa"
},
{
"forename": "T C",
"surname": "Santos"
}
],
"title": "Description of the larva of",
"volume": "99",
"year": 2008,
"categories": [
39,
203,
37,
52,
43,
40,
210
],
"oa_journal": false
}

Doesn't add much to the story, but does give us the sha1 for the PDF for the chimeric article (fe7eed3f6c43a3be1480a0937229b9ad33666df4). If I download the PDF for the article in Iheringia, Sér. Zool. it has the same sha1:


openssl sha1 a11v93n2.pdf
SHA1(a11v93n2.pdf)= fe7eed3f6c43a3be1480a0937229b9ad33666df4

This article doesn't exist
So, to summarise, this paper doesn't exist. It is credited to a journal that doesn't have DOIs, the DOI resolves to an article in a different journal, the abstract comes from another article in another journal, and the PDF is from a third article. OMG!

This is just weird
So, something about the way Mendeley merges references is broken. Merging references is a tough problem so there will always be cases where things go wrong. But it would be really, really helpful if Mendeley could display the set of articles that it has merged to create each canonical reference (say by listing the UUIDs for each article). Users could then see if badness had happened, and provide feedback, for example by highlighting references that are clearly the same, and those that are clearly different. Until this happens I'm a bit nervous about trusting Mendeley with my bibliographic data, I don't want it mangled into chimeric papers that don't exist.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.