Search this keyword

Showing posts with label CouchDB. Show all posts
Showing posts with label CouchDB. Show all posts

Fuzzy matching taxonomic names using ngrams

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users. There are various ways to tackle this problem, and Tony Rees' TAXAMATCH is perhaps the best known solution.

Cloudant supports Lucence for full text searching, but while this allows some possibility for approximate matching (by appending "~" to the search string) initial experiments suggested it wasn't going to be terribly useful. What does seem to work is to use ngrams. As a crude example, here is a CouchDN view that converts a string (in this case a taxon name) to a series of trigrams (three letter strings) then indexes their concatenation.


{
"_id": "_design/taxonname",
"language": "javascript",
"indexes": {
"all": {
"index": "function(doc) { if (doc.docType == 'taxonName') { var n = doc.nameComplete.length; var ngrams = []; for (var i=0; i < n-2;i++) { var ngram = doc.nameComplete.charAt(i) + doc.nameComplete.charAt(i+1) + doc.nameComplete.charAt(i+2); ngrams.push(ngram); } if (n > 2) { ngrams.push('$' + doc.nameComplete.charAt(0) + doc.nameComplete.charAt(1)); ngrams.push(doc.nameComplete.charAt(n-2) + doc.nameComplete.charAt(n-1) + '$'); } ngrams.sort(); index(\"default\", ngrams.join(' '), {\"store\": \"yes\"}); } }"
}
}
}

To search this view for a name I then generate trigrams for the query string (e.g., "Pomatomix" becomes "$Po Pom oma mat ato tom omi mix ix$" where "$" signals the start or end of the string) and search on that. For example, append this string to the URL of the CouchDB database to search for "Pomatomix":


_design/taxonname/_search/all?q=$Po%20Pom%20oma%20mat%20ato%20tom%20omi%20mix%20ix$&include_docs=true&limit=10


Initial results are promising (searching on bigrams generated an alarming degree of matches that seemed rather dubious). I need to do some more work on this, but it might be a simple and quick way to support "did you mean?" for taxonomic names.

BioStor in the cloud

CloudantQuick note on an experimental version of BioStor that is (mostly) hosted in the cloud. BioStor currently runs on a Mac Mini and uses MySQL as the database. For a number of reasons (it's running on a Mac Mini and my knowledge of optimising MySQL is limited) BioStor is struggling a bit. It's also gathered a lot of cruff as I've worked on ways to map article citations to the rather messy metadata in BHL.

So, I've started to play with a version that runs in the cloud using my favourite database, CouchDB. The data is hosted by Cloudant, which now provides full text search powered by Lucene. Essentially, I simply take article-level metadata from BioStor in BibJSON format and push that to Cloudant. I then wrote a simple wrapper around querying CouchDB, couple that with the Documentcloud Viewer to display articles and citeproc-js to format the citations (not exactly fun, but someone is bound to ask for them), and a we have a simple, searchable database of literature.

If you want to try the cloud-based version go to http://biostor-cloud.pagodabox.com/ (code on Github).

Bcloud

I've been wanting to do this for a while, partly because this is how I will implement my entry in EOL's computational data challenge, but also because CrossRef's Metadata search shows the power of finding references simply by using full text search (I've shamelessly borrowed some of the interface styling from Karl Ward's code). David Shorthouse demonstrates what you can do using CrossRef's tool in his post Conference Tweets in the Age of Information Overconsumption. Given how much time I spend trying to parse taxonomic citations and match them to articles in CrossRef's database, or BioStor, I'm looking forward to making this easier.

There are two major limitations of this cloud version of BioStor (aprt from the fact it has only a subset of the articles in BioStor). The first is that the page images are still being served from my Mac Mini, so they can be a bit slow to load. I've put the metadata and the search engine in the cloud, but not the images (we're talking a terabyte or two of bitmaps).

The other limitation is that there's no API. I hope to address this shortly, perhaps mimicking the CrossRef API so if one has code that talks to CrossRef it could just as easily talk to BioStor.

Resolving free-form citations

Cms logoCrossRef have released CrossRef Metadata Search a nice tool that can take a free-form citation and return possible matches from CrossRef's database. If you get a match CrossRef can take the DOI and format for you it in a variety of styles using DOI content negotiation.

If, like me, you spend a lot of time trying to find DOIs (and other identifiers) for articles by first parsing citations into their component parts, then this is good news. It's also good news for publishers that may balk at one of CrossRef's requirements for joining its club: if you want DOIs for your articles it's not enough to submit metadata for your article, you also need to submit the list of references that article cites, including their DOIs. This requirement enables CrossRef to offer their "cited by" service, but imposes a burden on smaller journals operating on a tight budget (e.g., Zootaxa). With CrossRef Metadata Search you can just send author-supplied citation strings from the manuscript and have a good chance of finding the corresponding DOI, if it exists.

Of course, the service only works if the article has a DOI, so it's not a complete solution to being able to parse bibliographic citations into their component parts. But it's a nice model, and I'm tempted to apply the same approach to my databases, such as BioStor or my ever growing Mendeley library (which is larger than the Mendeley desktop client can easily handle). A quick way to do this would be to use Cloudant which has cloud-based CouchDB coupled with a Lucene-based fulltext search engine. If I've time I may try and put a demo together.

Exporting data from Australian Faunal Directory on CouchDB

Quick note to self about exporting data from my Australian Faunal Directory on CouchDB project. To export data from a CouchDB view you can use a list function (see Formatting with Show and List). Following the example on the Kanapes IDE blog, I created the following list function:

{
"_id": "_design/publication",
"_rev": "14-467dee8248e97d874f1141411f536848",
"language": "javascript",
"lists": {
"tsv": "function(head,req) {
var row;
start({
'headers': {
'Content-Type': 'text/tsv'
}
});
while(row = getRow()) {
send(row.value + '\\t' + row.key + '\\n');
}}"
},
"views": {
.
.
.
}
}


I can use this function with the view below, which lists Australian Faunal Directory publications by UUID ("value"), indexed by DOI ("key").

Couch

I can get the tab-delimited dump from http://localhost:5984/afd/_design/publication/_list/tsv/doi. Note that instead of, say, /afd/_design/publication/_view/doi to get the view, we use /afd/_design/publication/_list/tsv/doi to get the tab-delimited dump.

I've created files listing DOIs and BioStor ids for publications in the Australian Faunal Directory. I'll play with lists a bit more, specially as I would like to extract the mapping from the Australian Faunal Directory on CouchDB project and add it to the iTaxon project.

ZooBank on CouchDB: UUIDs, replication, and embedding the literature in taxonomic databases

ZooBankBannerLast December I released a web site called Australian Faunal Directory on CouchDB, which was part of my ongoing exploration of how to build a simple yet useful database of taxonomic names. In particular, I want to link names directly to the primary taxonomic literature. No longer is it adequate to simply list names, or list names with mangled bibliographic details (I'm looking at you, Catalogue of Life). This is the 21st century, so I expect one click from name to literature, or at the most two (via, say, a DOI). Nothing else will cut it.

CouchbaseThe Australian Faunal Directory (AFD) was an eye opener as it was the first serious use I'd made of CouchDB (now CouchBase). I'd played with replicating and forking data in 2010: Catalogue of Life and CouchDB, but the AFD project was bigger, and also inspired me to use web hooks to make the database editable. Suddenly this stuff started to look easy: no schema, simple web services, and tiny amounts of code.

ZooBank
So then my attention turned to ZooBank, which is "the official registry of Zoological Nomenclature, according to the International Commission on Zoological Nomenclature (ICZN)." ZooBank was proposed by Polaszek et al. (2005) in a short piece in Nature ("A universal register for animal names", doi:10.1038/437477a). By providing a registry of names for animals, ultimately it aims to help avoid embarrassing situations such as the example I recount in my paper on BioStor (doi:10.1186/1471-2105-12-187): a recent paper in Nature published the name Leviathan for an extinct sperm whale with a giant bite (doi:10.1038/nature09067), only for authors to have to publish an erratum with a new name (doi:10.1038/nature09381) when it was discovered that Leviathan had already been used for an extinct mammoth.

ZooBank is developed and run by Rich Pyle, and has some nice features, such as RDF export (via LSIDs), but like most taxonomic databases it doesn't link directly to the literature. Where are the DOIs? Where are links to BHL? Where is the ability to add these links? And why is it almost entirely about fish? (OK, I know the answer to that one).

CouchDB
But the thing which really got me thinking about using CouchDB to create a version of ZooBank was Rich Pyle's vision of having a distributed ZooBank, and his insistence on using ugly UUIDs in ZooBank identifiers (e.g., urn:lsid:zoobank.org:act:6BBEF50E-76B4-42EF-97B1-7029DBCD8257). As much as they are ugly, Rich has always argued that they make distributed systems easy because you don't need a centralised system to assign unique identifiers.

Anybody who has played with CouchDB will know that CouchDB uses UUIDs by default to create identifiers for database documents. It also excels at data synchronisation, and can run on platforms large and small (including mobile such as Android and iOS). This means a database could be updated on an iPhone or iPad without an Internet connection, then the data could be synchronised with other databases. Indeed, I developed this CouchDB clone of ZooBank on my MacBook, then pointed it at CouchDB running on my server and within minutes had an exact copy of the database running on the server. This ease of replication, together with the joy of schema-less design makes CouchDB seem an obvious fit to ZooBank.

Demo
You can see the ZooBank on CouchDB demo here. It's not a complete copy of ZooBank, but has most of it. I reuse the UUIDs issued by ZooBank, so that

http://zoobank.org:80/?uuid=6bbef50e-76b4-42ef-97b1-7029dbcd8257

becomes

http://iphylo.org/~rpage/zoobank/6bbef50e-76b4-42ef-97b1-7029dbcd8257

As usual it's all a bit crude, but has some nice features, such as links to BHL content with a built in article viewer I wrote for the AFD project:

EtheostomaWhat's next?
At present only a fraction of the ZooBank references have external links, I hope to add more in the next few days, using both automatic scripts and the web hook interface. The search interface needs work, and being that ZooBank is about nomenclature and not taxonomy, it might be useful to add a classification (say from the Catalogue of Life) so that users can navigate around the names (and get a sense of how many are *cough* fish).

At present to display a reference I do one of four things:
  1. If reference is in BHL I use my article viewer
  2. If there is a freely available PDF online I display that using Google Docs PDF viewer
  3. If 1 and 2 don't apply, but there is a DOI then I resolve the DOI and display the result in an IFRAME (yuck)
  4. If none of 1-3 apply I display a blank rectangle

There are a couple ways we could improve this. The first is to enhance the display of BHL content by making use of the structure of the source DjVu files. Another is to make use of the XML now being made available by the journal Zookeys (see my blog post, and Pensoft's announcement that ZooKeys is now being archived by PubMed Central, complete with taxonomic markup). There are a lot of ZooKeys articles in ZooBank, so there's a lot of potential for embedding an article viewer that takes Zookeys XML and redisplays it with taxonomic names and references as clickable links that link to other ZooBank content. That way we approach the point where taxonomic literature becomes a first class citizen of a taxonomic database.

Linking taxonomic databases to the primary literature: BHL and the Australian Faunal Directory

Continuing my hobby horse of linking taxonomic databases to digitised literature, I've been working for the last couple of weeks on linking names in the Australian Faunal Directory (AFD) to articles in the Biodiversity Heritage Library (BHL). AFD is a list of all animals known to occur in Australia, and it provides much of the data for the recently released Atlas of Living Australia. The data is available as series of CSV files, and these contain quite detailed bibliographic references. My initial interest was in using these to populate BioStor with articles, but it seemed worthwhile to try and link the names and articles together. The Atlas of Living Australia links to BHL, but only via a name search showing BHL items that have a name string. This wastes valuable information. AFD has citations to individual books and articles that relate to the taxonomy of Australian animals — we should treat that as first class data.

So, I cobbled together the CSV files, some scripts to extract references, ran them through the BioStor and bioGUID OpenURL resolvers, and dumped the whole thing in a CouchDB database. You can see the results at Australian Faunal Directory on CouchDB.

afd.png


The site is modelled on my earlier experiment with putting the Catalogue of Life on CouchDB. It's still rather crude, and there's a lot of stuff I need to work on, but it should illustrate the basic idea. You can browse the taxonomic hierarchy, view alternative names for each taxon, and see a list of publications related to those names. If a publication has been found in BioStor then the site displays a thumbnail of the first page, and if you click on the reference you see a simple article viewer I wrote in Javascript.

v1.png


For PDFs I'm experimenting with using Google's PDF viewer (the inspiration for the viewer above):

v2.png



How it was made
Although in principle linking AFD to BHL via BioStor was fairly straight forward, these are lots of little wrinkles, such as errors in bibliographic metadata, and failure to parse some reference strings. To help address this I created a public group on Mendeley where all the references I've extracted are stored. This makes it easy to correct errors, add identifiers such as DOIs and ISSNs, and upload PDFs. For each article a reference to the original record in AFD is maintained by storing the AFD identifier (a UUID) as a keyword.

The taxonomy and the mapping to literature is stored in a CouchDB database, which makes a lot of things (such as uploading new versions of documents) a breeze.

It's about the links
The underlying motivation is that we are awash in biodiversity data and digitisation projects, but these are rarely linked together. And it's more than just linking, it's bring the data together so that we can compute over it. That's when things will start to get interesting.

CouchDB and Lucene

Quick notes to self on fulltext search and CouchDB. Note that links to CouchDB are local to my machine(s),and won't work unless you are me, or have a copy of the same database running on your machine). CouchDB and Lucene adds fulltext indexing to CouchDB. After a few false starts I now have this working. The documentation is a little misleading, you don't need to clone the github repository, nor use Maven to build couchdb-lucene (at least, I didn't). Instead I grabbed couchdb-lucene-0.5.6, unpacked it, used that as is.

To configure CouchDB I ended up editing the configuration using Futon (there's a link "Add a new section" down the bottom of the Configuration page), then I restarted CouchDB. The things to add are:


[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.
[external]
fti=/path/to/python /path/to/couchdb-lucene/tools/couchdb-external-hook.py
[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}


To start couchdb-lucene, just cd couchdb-lucene-0.5.6 and bin/run.

Then it's a case of adding a fulltext index. In Futon I start adding a regular design document, then edit the Javascript. For example, here is a simple index on document titles:


{
"_id": "_design/lucene",
"_rev": "2-96b333dfc77866a13c0de7f856d27b6c",
"language": "javascript",
"fulltext": {
"by_title": {
"index": "function(doc) {
var ret=new Document();
ret.add(doc.title);
return ret
}"
}
}
}


Once the indexing has been completed, you can search the CouchDB database using a URL like this: http://localhost:5984/col2010ref/_fti/_design/lucene/by_title?q=frog+new+species.

Lots more to do here, but with spatial queries and now fulltext search, it's time to start building something...

Replicating and forking data in 2010: Catalogue of Life and CouchDB

Time (just) for a Friday folly. A couple of days ago the latest edition of the Catalogue of Life (CoL) arrived in my mailbox in the form of a DVD and booklet:

photo.JPG
While in some ways it's wonderful that the Catalogue of Life provides a complete data dump of its contents, this strikes me as a rather old-fashioned way to distribute it. So I began to wonder how this could be done differently, and started to think of CouchDB. In particular, I began to think of being able to upload the data to a service (such as Cloudant) where the data could be stored and replicated at scale. then I began to think about forking the data. The Catalogue of Life has some good things going for it (some 1.25 million species, and around 2 million names), and is widely used as the backbone of sites such as EOL, GBIF, and iNaturalist.org, but parts of it are broken. Literature citations are often incomplete or mangled, and in places it is horribly out of date.

Rather than wait for the Catalogue of Life to fix this, what if we could share the data, annotate it, correct mistakes, and add links? In particular, what if we link the literature to records in the Biodiversity Heritage Library so at we can finally start to connect names to the primary literature (imagine clicking on a name and being able to see the original species description). We could have something akin to github, but instead of downloading and forking code, we download and fork data. CouchDB makes replicating data pretty straightforward.

So, I've started to upload some Catalogue of Life records to a CouchDB instance at Cloudant, and write a simple web site to display these records. For example, you can see the record for at http://iphylo.org/~rpage/col/?id=e9fda47629c1102b9a4a00304854f820:

croc.png
The e9fda47629c1102b9a4a00304854f820 in this URL is the UUID of the record in CouchDB, which is also the UUID embedded in the (non-functional) CoL LSIDs. This ensures the records have a unique identifier, but also one that is related to the original record. You can search for names, or browse the immediate hierarchy around a name. I hope to add more records over time as I explore this further — at the moment I've added a few lizards, wasps, and conifers while I explore how to convert the CoL records into a sensible JSON object to upload to CouchDB.

The next step is to think about this as a way to distribute data (want a copy of CoL, just point your CouchDB at the Cloudant URL and replicate it), and to think about how to build upon the basic records, editing and improving them, then thinking about how to get that information into a future version of the Catalogue.

GeoCouch

@mikeal a little tedious. you can take OSM and then convert it to SHP and then http://github.com/maxogden/shp2geocouchless than a minute ago via web



The tweet above inspired me to take a quick look at GeoCouch, a version of CouchDB that supports spatial queries. This is something I need if I'm going to start playing seriously with CouchDB. So, it was off to Installing and working with GeoCouch, grabbing a copy of HomeBrew (yet another package manager for Mac OS X), in the hope of installing GeoCouch. Things went fairly smoothly, although it took what seemed like an age to build everything. But I now have GeoCouch running. Previously I'd been running CouchDB using http://janl.github.com/couchdbx/, which launches vanilla CouchDB. However, if you launch CouchDBX after starting GeoCouch from the command line, CouchDBX is talking to GeoCouch.

I then grabbed shp2geocouch to try some shape files (I grabbed some shape files from the IUCN to play with). If you're on a Mac grab GISLook to get Quick Look previews of these files. Since I'm new to ruby there were a couple of gotchas, such as lacking some prerequisites (httparty and couchrest, both installed by typing gem install <name of package>), and there was the small matter of needing to add ~/.gem/ruby/1.8/bin to my path so I could find shp2geocouch (spot the ruby neophyte). The shape file didn't get processed completely, but at least I managed to get some data into GeoCouch.

gis.png
So far I've been playing with the examples at http://github.com/vmx/couchdb, and things seem to work. At least, the basic bounding box queries work. I'm tempted to play with this some more (and get my head arounbd GeoJSON), perhaps trying to recreate the functionality of my Elsevier Challenge entry, for which I wrote a custom key-value database that was awfully clunky.

CouchDB, Mendeley, and what I really want in an iPad article viewer

Playing with @couchdb, starting to think of the Mendeley API as a read/write JSON store, and having a reader app built on that...less than a minute ago via Tweetie for Mac



It's slowly dawning on me that many of the ingredients for an alternative different way to browse scientific articles may already be in place. After my first crude efforts at what an iPad reader might look like I've started afresh with a new attempt, based on the Sencha Touch framework. The goal here isn't to make a polished app, but rather to get a sense of what could be done.

The first goal is to be able to browse the literature as if it was a connected series of documents (which is what, of course, it is). This requires taking the full text of an article, extracting the citations, and making them links to further documents (also with their citations extracted, and so on). Leaving aside the obvious problem that this approach is limited to open access articles, an app that does this is going to have to store a lot of bibliographic data as the reader browses the literature (otherwise we going to have to do all the processing on the fly, and that's not going to be fast enough). So, we need some storage.

MySQL
One option is to write a MySQL database to hold articles, books, etc. Doable (I've done more of these than I care to remember), but things get messy pretty quickly, especially as you add functionality (tags, fulltext, figures, etc.).

RDF
Another option is to use RDF and a tripe store. I've played with linked data quite a bit lately (see previous "Friday follies" here and here), and I thought that a triple store would be a great way support an article browser (especially as we add additional kinds of data, such as sequences, specimens, phylogenies, etc.). But linked data is a mess. For the things I care about there are either no canonical identifiers, or too many, and rarely does the primary data provider served linked data compliant URLs (e.g., NCBI), hence we end up with a plethora of wrappers around these sources. Then there's the issue of what vocabularies to use (once again, there are either none, or too many). As a query language SPARQL isn't great, and don't even get me started on the issue of editing data. OK, so I get the whole idea of linked data, it's just that the overhead of getting anything done seems too high. You've got to get a lot of ducks to line up.

CounchDBlogo.png
So, I started playing with CounchDB, in a fairly idle way. I'd had a look before, but didn't really get my head around the very different way of querying a database that CouchDB requires. Despite this learning curve, CouchDB has some great features. It stores documents in JSON, which makes it trivial to add data as objects (instead of mucking around with breaking them up into tables for SQL, or atomising them into triples for RDF), it supports versioning right out of the box (vital because metadata is often wrong and needs to be tidied up), and you talk to it using HTTP, which means no middleware to get in the way. You just point your browser (or curl, or whatever HTTP tool you have) and send GET, POST, PUT, or DELETE commands. And now it's in the cloud.

In some ways ending up with CouchDB (or something similar) seems inevitable. The one "semantic web" tool that I've made most use of is Semantic MediaWiki, which powers the NCBI to Wikipedia mapping I created in June. Semantic Mediawiki has it's uses, but occasionally it has driven me to distraction. But, when you get down to it, Semantic Mediawiki is really just a versioned document store (where the documents are typically key-value pairs), over which have been laid a pretty limited query language and some RDF export features. Put like this, most of the huge Mediawiki engine underlying Semantic MediaWiki isn't needed, so why not cut to the chase and use a purpose-built versioned document store? Enter CouchDB.

Browsing and Mendeley
So, what I have in mind is a browser that crawls a document, extracting citations, and enabling the reader to explore those. Eventually it will also extract all the other chocolatey goodness in an article (sequences, specimens, taxonomic names, etc.), but for now I'm focussing on articles and citations. A browser would need to store article metadata (say, each time it encounters an article for the first time), as well as update existing metadata (by adding missing DOIs, PubMed ids, citations, etc.), so what easier way than as JSON in a document store such as CouchDB? This is what I'm exploring at the moment, but let's take a step back for a second.

The Mendeley API, as poorly developed as it is, could be treated as essentially a wrapper around a JSON document store (the API stores and returns JSON), and speaks HTTP. So, we could imagine a browser that crawls the Mendeley database, adding papers that aren't in Mendeley as it goes. The act of browsing and reading would actively contribute to the database. Of course, we could spin this around, and argue that a crawler + CouchDB could pretty effectively create a clone of Mendeley's database (albeit without the social networking features that come with have a large user community).

This is another reason why the current crop of iPad article viewers, Mendeley's included, are so disappointing. There's the potential to completely change the way we interact with the scientific literature (instead of passively consuming PDFs), and Mendeley is ideally positioned to support this. Yes, I realise that for the vast majority of people being able to manage their PDFs and format bibliographies in MS Word are the killer features, but, seriously, is that all we aspire too?