Search this keyword

Phantom articles: why Mendeley needs to make duplication transparent

Browsing Mendeley I found the following record: http://www.mendeley.com/research/description-larva/. This URL is for a paper
Costa, J. M., & Santos, T. C. (2008). Description of the larva of. Zootaxa, 99(2), 129-131
which apparently has the DOI doi:10.1645/GE-2580.1. This is strange because Zootaxa doesn't have DOIs. The DOI given resolves to a paper in the Journal of Parasitology:
Harriman, V. B., Galloway, T. D., Alisauskas, R. T., & Wobeser, G. A. (2011). Description of the larva of Ceratophyllus vagabundus vagabundus (Siphonaptera: Ceratophyllidae) from nests of Rossʼs and lesser snow geese in Nunavut, Canada. The Journal of parasitology, 93(2), 197-200
Now, this paper has it's own record in Mendeley.

OK, so this is weird..., but it gets weirder. If you look at the Mendeley page for this chimeric article there is a PDF preview of yet another article:
LOPES, Maria José Nascimento; FROEHLICH, Claudio Gilberto and DOMINGUEZ, Eduardo (2003). Description of the larva of Thraulodes schlingeri (Ephemeroptera, Leptophlebiidae). Iheringia, Sér. Zool. 92(2), 197-200 2003 doi:10.1590/S0073-47212003000200011
Mendeley duplicate

But it gets even more interesting. The abstract for the phantom Zootaxa article belongs to yet another paper:
Marques, K. I. D. S., & Xerez, R. D.Description of the larva of Popanomyia kerteszi James & Woodley (Diptera: Stratiomyidae) and identification key to immature stages of Pachygastrinae. Neotropical Entomology, 38(5), 643-648.
which also exists in Mendeley.

To investigate further I used Mendeley's API to retrieve this record (I had to look at the source of the web page to find the internal identifier used by Mendeley, namely 010c48d0-edb5-11df-99a6-0024e8453de6 to do this, why does Mendeley hide these?). Here's the abbreviated JSON for this record.

{
...
"website": "http:\/\/www.ncbi.nlm.nih.gov\/pubmed\/21506868",
"identifiers": {
"pmid": "21506868",
"issn": "19372345",
"doi": "10.1645\/GE-2580.1"
},
...
"issue": "2",
"pages": "129-131",
"public_file_hash": "fe7eed3f6c43a3be1480a0937229b9ad33666df4",
"publication_outlet": "Zootaxa",
"type": "Journal Article",
"mendeley_url": "http:\/\/www.mendeley.com\/research\/description-larva\/",
"uuid": "010c48d0-edb5-11df-99a6-0024e8453de6",
"authors": [
{
"forename": "J M",
"surname": "Costa"
},
{
"forename": "T C",
"surname": "Santos"
}
],
"title": "Description of the larva of",
"volume": "99",
"year": 2008,
"categories": [
39,
203,
37,
52,
43,
40,
210
],
"oa_journal": false
}

Doesn't add much to the story, but does give us the sha1 for the PDF for the chimeric article (fe7eed3f6c43a3be1480a0937229b9ad33666df4). If I download the PDF for the article in Iheringia, Sér. Zool. it has the same sha1:


openssl sha1 a11v93n2.pdf
SHA1(a11v93n2.pdf)= fe7eed3f6c43a3be1480a0937229b9ad33666df4

This article doesn't exist
So, to summarise, this paper doesn't exist. It is credited to a journal that doesn't have DOIs, the DOI resolves to an article in a different journal, the abstract comes from another article in another journal, and the PDF is from a third article. OMG!

This is just weird
So, something about the way Mendeley merges references is broken. Merging references is a tough problem so there will always be cases where things go wrong. But it would be really, really helpful if Mendeley could display the set of articles that it has merged to create each canonical reference (say by listing the UUIDs for each article). Users could then see if badness had happened, and provide feedback, for example by highlighting references that are clearly the same, and those that are clearly different. Until this happens I'm a bit nervous about trusting Mendeley with my bibliographic data, I don't want it mangled into chimeric papers that don't exist.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

More BHL app ideas

Hero rosellasFollowing on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elywreplied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiversitylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

n24_w1150Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:
  1. Use BHL web server logs to find and extract referrals from those projects
  2. Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.