Search this keyword

Showing posts with label citation. Show all posts
Showing posts with label citation. Show all posts

Citations, Social Media & Science

Quick note that Morgan Jackson (@BioInFocus) has written nice blog post Citations, Social Media & Science inspired by the fact that the following paper:

Kwong, S., Srivathsan, A., & Meier, R. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, no–no. doi:10.1111/j.1096-0031.2012.00408.x

cites my "Dark taxa" in the body of the text but not in the list of literature cited. This prompted some discussion of DOIs and blog posts on Twitter:



Read Morgan's post for more on this topic. While I personally would prefer to see my blog posts properly cited in papers like doi:10.1111/j.1096-0031.2012.00408.x, I suspect the authors did what they could given current conventions (blogs lack DOIs, are treated differently from papers, and many publishers cite URLs in the text, not the list of references cited). If we can provide DOIs (ideally from CrossRef so we become part of the regular citation network), suitable archiving, and — most importantly — content that people consider worthy of citation then perhaps this practice will change.

Yet another reason why we need specimen identifiers, now!

This message appeared on the TAXACOM mailing list:

It is getting more and more necessary for taxonomists to demonstrate
that they are useful and used. This does not only apply to the
individual scientists, but also to institutions with taxonomic
collections, such as museums and herbaria.

In an attempt to live up to that increasing demand for documentation,
the leadership of the Natural History Museum of Denmark has issued an
order to its curatorial staff - The staff members are requested to
document which publications from 2011, written entirely by external
scientists, that in one way or another are based on material in the
collections of the Museum.


Given that most specimens lack resolvable digital identifiers (a theme I've harped on about before, most recently in the context of DNA barcoding), answering this kind of query ends up being a case of searching publications for text strings that contain the acronym of the collection. The sender of the message, Ib Friis, is alarmed at this prospect:

In publications, material from our herbarium at "C" is normally referred
to in text strings of one of the following forms: "(C)", "(C, ", ", C,"
or " C)". But a search in for example Google Scholar or other search
engines result in overflow of thousands and thousands of hits, even
when these text strings are combined with other relevant words such as
"botany", "plants", etc.


In an earlier paper "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (http://dx.doi.org/10.1093/bib/bbn022) (free preprint available here: hdl:10101/npre.2008.1760.1) I argued that having resolvable identifiers for specimens could enable measures of "citation" to be computed for specimens (and data derived from those specimens). Just as we have citation counts for articles and impact factors for journals, we could have equivalent measures for specimens and collections. These measures may keep administrators happy, for scientists I think the real benefits will be the ability to trace the provenance of some data, and the fate of data they themselves have collected or published.

For things such as publications it is trivial to track their usage. For example, to find the number of times the article "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" has been cited, I simply enter the DOI into Google Scholar, e.g. http://scholar.google.co.uk/scholar?q=10.1093/bib/bbn022. Imagine being able to do the same for specimens?

For this to happen, museum specimens need digital identifiers. If museums are serious about quantifying the impact of their collections, they should make assigning digital identifiers a priority.

BHL needs to engage with publishers (and EOL needs to link to primary literature)

Browsing EOL I stumbled upon the recently described fish Protoanguilla palau, shown below in an image by rairaiken2011:
Palauan Primitive Cave Eel

Two things struck me, the first is that the EOL page for this fish gives absolutely no clue as to where you would to find out more about this fish (apart from an unclickable link to the Wikipedia page http://en.wikipedia.org/wiki/Protoanguilla - seriously, a link that isn't clickable?), despite the fact this fish has been recently described in an Open Access publication ("A 'living fossil eel (Anguilliformes: Protanguillidae, fam. nov.) from an undersea cave in Palau", http://dx.doi.org/10.1098/rspb.2011.1289).

Now that I've got my customary grumble about EOL out of the way, let's look at the article itself. On the first page of the PDF it states:
This article cites 29 articles, 7 of which can be accessed free
http://rspb.royalsocietypublishing.org/content/early/2011/09/16/rspb.2011.1289.full.html#ref-list-1

So 22 of the articles or books cited in this paper are, apparently, not freely available. However, looking at the list of literature cited it becomes obvious that rather more of these citations are available online than we might think. For example, there are articles that are in the Biodiversity Heritage Library (BHL), e.g.


Then there are articles that are available in other digitising projects

  • Hay O. P. 1903 On a collection of Upper Cretaceous fishes from Mount Lebanon, Syria, with descriptions of four new genera and nineteen new species. Bull. Am. Mus. Nat. Hist. N. Y. 19, 395–452. http://hdl.handle.net/2246/1500
  • Nelson G. J. 1966 Gill arches of fishes of the order Anguilliformes. Pac. Sci. 20, 391–408. http://hdl.handle.net/10125/7805

Furthermore, there are articles that aren't necessarily free, but which have been digitised and have DOIs that have been missed by the publisher, such as the Regan paper above, and


So, the Proceedings of the Royal Society has underestimated just how many citations the reader can view online. The problem, of course, is how does a publisher discover these additional citations? Some have been missed because of sloppy bibliographic data. The missing DOIs are probably because the Regan citation lacks a volume number, and the Trewavas paper uses a different volume number to that used by Wiley (who digitised Proc. Zool. Soc. Lond.). But the content in BHL and other digital archives will be missed because finding these is not part of a publisher's normal workflow. Typically citations are matched by using services ultimately provided by CrossRef, and the bulk of BHL content is not in CrossRef.

So it seems there's an opportunity here for someone to provide a service for publishers that adds value to their content in at least three ways:
  1. Add missing DOIs due to problematic citations for older literature
  2. Add links to BHL content
  3. Add links to content in additional digitisation projects, such as journal archives in DSpace respositories

For readers this would enhance their experience (more of the literature becomes accessible to them), and for BHL and the repositories it will drive more readers to those repositories (how many people reading the paper on Protoanguilla palau have even heard of BHL?). I've said most of this before, but I really think there's an opportunity here to provide services to the publishing industry, and we don't seem to be grasping it yet.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.



To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

What is the best way to measure academic outputs that aren't publications?

My institute is going through various reviews of staff performance and, frankly, I'm feeling somewhat vulnerable given my somewhat unorthodox (at least amongst my colleagues) approach to doing science. I spend way more time writing code, building databases and web sites, and blogging than writing papers and getting grants (although I have been known to do both).

So the issue becomes, how to demonstrate that coding, building websites, and ranting on my blog is a worthwhile thing to do? Now, I'm happy that what I do has value, but my happiness isn't the issue. It's convincing people who want to see papers in high impact journals and bums on seats in labs that there's other ways to generate scientific output, and that output can have value. I'm also concerned that a simplistic view of what constitutes valid outputs will stifle innovation, just at the time when traditional science publishing is undergoing a revolution.

So, I posted a question on Quora:What is the best way to measure academic outputs that aren't publications?, where I wrote:
Usually we assess the quality of academic output using measures based on citations, either directly (how many papers have cited the paper?) or indirectly (is the paper published in a journal like Nature or Science that contains papers that on average get lots of citations, i.e. "impact factor"). But what of other outputs, such as web sites, databases, and software? These outputs often require considerable work, and can be widely used. What is the best way to measure those outputs?


There have been various approaches to measuring the impact of an article other than using citations, such as the number of article downloads, or the number of times an article has been bookmarked on a site such as Mendeley or CiteULike. But what of the coding, the database development, the web sites, and the blog posts. How can I show that these have value?

I guess there are two things here. One is the need to be able to compare across outputs, which is tricky (comparing citations across different disciplines is already hard), the other is the need to be able to compare within broadly similar outputs. Here are some quick thoughts:

Web sites
An obvious approach is to use Google Analytics to harvest information about page views and visitor numbers. The geographic origin of those visitors could be used to make a case for whether the research/data on that site is internationally relevant, although I suspect "internationally relevant" is a somewhat suspect notion. Most academic specialities are narrow, such that the person most interested in your research is likely living in a different country, hence by definition most research will be internationally "relevant".

The advantage of Google Analytics is that it is widely used, hence you could get comparative data and be able to show that your web site is more (or less) used that another site.

Code
The value of code is tricky, but tools like ohloh provide estimates of the effort and expense required to generate code for a project. For example, for my bioGUID code repository (which includes code for bioGUID and BioStor, as well as some third party code) ohloh's estimated cost is 87 person-years and $US 4,784,203. OK, silly numbers, but at least I can compare these with other projects (Drupal, for example, represents 153 years and $US 8,438,417 of investment).


Comparing across output categories will be challenging, especially as there is no obvious equivalent for citation (one reason why if you develop software or a web site it makes good sense to write a paper describing it, worked for me). But perhaps download or article access statistics could provide a way to say "my web site is worth x publications. Note also that I'm not arguing that any of these measures is actually a good thing, just that if I'm going to be measured, and I have some say in how I'm measured, I'd like to suggest something sensible that others might actually buy.

So, please feel free to comment either here or on Quora . I need to put together some notes to make the case that people like me aren't just sitting drinking coffee, playing loud music, and tweeting without, you know, actually making stuff.

Data matters but do data sets?

Interest in archiving data and data publication is growing, as evidenced by projects such as Dryad, and earlier tools such as TreeBASE. But I can't help wondering whether this is a little misguided. I think the issues are granularity and reuse.

Taking the second issue first, how much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses.

Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much").

But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.

To me, citing data sets makes almost as much sense as citing journal volumes - the level of granularity is wrong. Journal volumes are largely arbitrary collections of articles, it's the articles that are the typical unit of citation. Likewise I think sequences will be cited more often than alignments.

It might be argued that there are disciplines where the dataset is the sensible unit, such as an ecological study of a particular species. Such a data set may lack obvious subsets, and hence it makes sense to be cited as a unit. But my expectation here is that such datasets will see limited re-use, for the very reason that they can't be easily partitioned and mashed up. Data sets, such as alignments, are built from smaller, reusable units of data (i.e., sequences) can be recombined, trimmed, or merged, and hence can be readily re-used. Monolithic datasets with largely unique content can't be easily mashed up with other data.

Hence, my suspicion is that many data sets in digital archives will gather digital dust, and anyone submitting a data set in the expectation that it will be cited may turn out to be disappointed.

Touching citations on the iPad

Quick demo of the mockup I alluded to in the previous post. Here's a screen shot of the article "PhyloExplorer: a web server to validate, explore and query phylogenetic trees" (doi:10.1186/1471-2148-9-108) as displayed as a web-app on the iPad. You can view this at http://iphylo.org/~rpage/ipad/touch/ (you don't need an iPad, although it does work rather better on one).

touch.png
I've taken the XML for the article, and redisplayed it as HTML, with (most) of the citations highlighted in blue. If you touch one (or click on it if you're using a desktop browser) then you'll see a popover with some basic bibliographic details. For some papers which are Open Access I've extracted thumbnails of the figures, such as for "PhyloFinder: an intelligent search engine for phylogenetic tree databases" (doi:10.1186/1471-2148-8-90), shown above (and in more detail below):

popover.png
The idea is to give the reader a sense of what the paper is about, beyond can be gleaned from just the title and authors. The idea was inspired by the Biotext search engine from Marti Hearst's group, as well as Elsevier's "graphical abstract" noted by Alex Wild (@Myrmecos).

Here's a quick screencast showing it "live":



The next step is to enable the reader to then go and read this paper within the iPad web-app (doh!), which is fairly trivial to do, but it's Friday and I'm already late...

Viewing scientific articles on the iPad: iBooks

Apple's iBooks app is an ePub and PDF reader, and one could write a lengthy article about its interface. However, in the context of these posts on visualising the scientific article there's one feature that has particularly struck me. When reading a book that cited other literature the citations are hyper-links: click on one and iBooks forwards you (via the page turning effect) to the reference in the book's bibliography. This can be a little jarring (one minute you're reading the page, next you're in the bibliography), but to help maintain context the reference is preceded by the snippet of text in which it is cited:

touch7.jpg

To make this concrete, here's an example from Clarky Shirky's "Cognitive Surplus."

ibooks.jpg

In the body of the text (left) the text "notes in his book The Success of Open Source" (which I've highlighted in blue) is a hyper-link. Click on it, and we see the source of the citation (right), together with the text that formed the hyper-link. This context helps remind you why you wanted to follow up the citation, and also provides the way back to the text: click on the context snippet and you're taken back to the original page.

Providing context for a citation is a nice feature, and there are various ways to do this. For example, the Elsevier Life Sciences Challenge entry by Wan et al. ("Supporting browsing-specific information needs: Introducing the Citation-Sensitive In-Browser Summariser", doi:10.1016/j.websem.2010.03.002, see also an earlier version on CiteSeer) takes a different approach. Rather than provide local context for a citation in an article (a la iBooks), Wan et al. provide context-sensitive summaries of the reference cited to help the the reader judge whether it's worth her time to fetch the reference and read it.

Both of these approaches suggest that we could be a lot more creative about how we display and interact with citations when viewing an article.

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:
canopy.png
When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...

Google Scholar metadata quality and Mendeley hype

Hot on the heels of Geoffrey Nunberg's essay about the train wreck that is Google books metadata (see my earlier post) comes Google Scholar’s Ghost Authors, Lost Authors, and Other Problems by Péter Jacsó. It's a fairly scathing look at some of the problems with the quality of Google Scholar's metadata.

Now, Google Scholar isn't perfect, but it's come to play a key role in a variety of bibliographic tools, such as Mendeley, and Papers. These tools do a delicate dance with Google Scholar who, strictly speaking, don't want anybody scraping their content. There's no API, so Mendeley, Papers (and my own iSpecies) have to keep up with the HTML tweaks that Google introduces, pretend to be web browsers, fuss with cookies, and try to keep the rate of queries below the level at which the Google monster stirs and slaps them down.

Jacsó's critique also misses the main point. Why do we have free (albeit closed) tools like Google Scholar in the first place? It's largely because scientists have ceeded the field of citation analysis to commercial companies, such as Elsevier and Thompson Reuters. To echo Martin Kalfatovic's comment:
Over the years, we've (librarians and the user community) have allowed an important class of metadata - specifically the article level metadata - migrate to for profit entities.
Some visionaries, such as Robert Cameron in his A Universal Citation Database as a Catalyst for Reform in Scholarly Communication, argued for free, open citation databases, but this came to nought.

For me, this is the one thing the ridiculously over-hyped Mendeley could do that would merit the degree of media attention it is getting -- be the basis of an open citation database. It would need massive improvement to its metadata extraction algorithms, which currently suck (Google Scholar's, for all Jacsó's complaints, are much better), but it would generate something of lasting value.




Scientific citations in Wikipedia

wikipediaisaccuratecitationneeded.jpg
While thinking about measuring the quality of Wikipedia articles by counting the number of times they cite external literature, and conversely measuring the impact of papers by how many times they're cited in Wikipedia, I discovered, as usual, that somebody has already done it. I came across this nice paper by Finn Årup Nielsen (arXiv:0705.2106v1) (originally published in First Monday as a HTML document, I've embedded the PDF from arXiv below).

Nielsen retrieved 30,368 citations from Wikipedia, and summarised how many times each journal is cited within Wikipedia. He then compared this with a measure of citations within the scientific literature by multiplying the journal's impact factor by the total number of citations. In general there's a pretty good correlation.
1997-20088-1-PB.gif


What is striking to me is that
When individual journals are examined Wikipedia citations to astronomy journals stand out compared to the overall trend (Figure 2). Also Australian botany journals received a considerable number of citations, e.g., Nuytsia (101 [citations]), in part due to concerted effort for the genus Banksia, where several Wikipedia articles for Banksia species have reached "featured article" status.


In the diagram, note also that Australian Systematic Botany (ISSN 1030-1887), which has a impact factor of 1.351, is punching well above its weight in Wikipedia. What I want to find out is whether this is true for other taxonomic journals. Nielsen's study was based on a Wikipedia dump from 2 April 2007, and a lot has been added since then (and the journal Zootaxa has become a major publisher of new taxonomic names).

But what I'm also wondering is whether this is not a great opportunity for the taxonomic community. By responding to {{citation needed}}, we can improve the quality of Wikipedia, and increase the visibility of their work. Given that many Wikipedia taxon pages are in the top 10 Google hits {{citation needed}}, our work is but one click away from the Google results page. Instead of endlessly moaning about the low impact factor of taxonomic journals, we can actively do something that increases the quality and visibility of taxonomic information, and by extension, taxonomy itself.

Scientific citations in Wikipedia