Search this keyword

Showing posts with label citation matching. Show all posts
Showing posts with label citation matching. Show all posts

Resolving free-form citations

Cms logoCrossRef have released CrossRef Metadata Search a nice tool that can take a free-form citation and return possible matches from CrossRef's database. If you get a match CrossRef can take the DOI and format for you it in a variety of styles using DOI content negotiation.

If, like me, you spend a lot of time trying to find DOIs (and other identifiers) for articles by first parsing citations into their component parts, then this is good news. It's also good news for publishers that may balk at one of CrossRef's requirements for joining its club: if you want DOIs for your articles it's not enough to submit metadata for your article, you also need to submit the list of references that article cites, including their DOIs. This requirement enables CrossRef to offer their "cited by" service, but imposes a burden on smaller journals operating on a tight budget (e.g., Zootaxa). With CrossRef Metadata Search you can just send author-supplied citation strings from the manuscript and have a good chance of finding the corresponding DOI, if it exists.

Of course, the service only works if the article has a DOI, so it's not a complete solution to being able to parse bibliographic citations into their component parts. But it's a nice model, and I'm tempted to apply the same approach to my databases, such as BioStor or my ever growing Mendeley library (which is larger than the Mendeley desktop client can easily handle). A quick way to do this would be to use Cloudant which has cloud-based CouchDB coupled with a Lucene-based fulltext search engine. If I've time I may try and put a demo together.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.