Accounting Careers

BioNames update - matching taxon names to classifications

2013-03-20T06:57:00.000-07:00

On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority. In many cases this is straightforward, especially if there is only one taxon with a name. But often we have cases where the same name has been used more than once for different taxa. For example, here is what ION has for the name "Nystactes".

Nystactes Bohlke	2735131
Nystactes	2787598
Nystactes Gloger 1827	4888093
Nystactes Kaup 1829	4888094

If I want to map these names to GBIF then these are corresponding taxa with the name "Nystactes":

Nystactes Böhlke, 1957	2403398
Nystactes Gloger, 1827	2475109
Nystactes Kaup, 1829	3239722

Clearly the names are almost identical, but there are enough little differences (presence or absence of comma, "o" versus "ö") to make things interesting. To make the mapping I construct a bipartite graph where the nodes are taxon names, divided into two sets based on which database they came from. I then connect the nodes of the graph by edges, weighted by how similar the names are. For example, here is the graph for "Nystactes" (displayed using Google images:

I then compute the maximum weighted bipartite matching using a C++ program I wrote. This matching corresponds to the solid lines in the graph above.

In this way we can make a sensible guess as to how names in the two databases relate to one another.

BioNames update - API documentation

2013-03-19T01:39:00.000-07:00

One of the fun things about developing web sites is learning new tricks, tools, and techniques. Typically I hack away on my MacBook, and when something seems vaguely usable I stick it on a web server. For BioNames things need to be a little more formalised, especially as I'm collaborating with another developer (Ryan Schenk). Ryan is focussing on the front end, I'm working on the data (harvesting, cleaning, storing).

In most projects I've worked on the code to talk to the database and the code to display results have been the same, it was ugly but it got things. For this project these two aspects have to be much more cleaning separated so that Ryan and I can work independently. One way to do this is to have a well-defined API that Ryan can develop against. This means I can hide the sometimes messy details of how to communicate with the data, and Ryan doesn't need to worry about how to get access to the data.

Nice idea, but to be workable it requires that the API is documented (if it's just me then the documentation is in my head). Documentation is a pain, and it is easy for it to get out of sync with the code such that what the docs say an API does and what it actually does are two separate things (sound familiar?). What would be great is a tool that enables you to write the API documentation, and make that "live" so that the API output can be tested against. In other words, a tool like apiary.io.

Apiary.io is free, very slick, and comes with GitHUb integration. I've started to document the BioNames API at http://docs.bionames.apiary.io/. These documents are "live" in that you can try out the API and get live results from the BioNames database.

I'm sure this is all old news to real software developers (as opposed to people like me who know just enough to get themselves into trouble), but it's quite liberating to start with the API first before worrying about what the web site will look like.

New look Biodiversity Heritage Library launched

2013-03-18T02:50:00.000-07:00

Tomorrow the new & improved #bhlib launches!! ow.ly/iVeZb Explore the changes in our Guide! ow.ly/iVf1W
— BHL (@BioDivLibrary) March 17, 2013

The new look Biodiversity Heritage Library has just launched. It's a complete refresh of the old site, based on the Biodiversity Heritage Library–Australia site. If you want an overview of what's new, BHL have published a guide to the new look site. Congrats to involved in the relaunch.

One of the new features draws on the work I've been doing on BioStor. The new BHL interface adds the notion of "parts" of an item, which you can see under the "Table of Contents" tab. For example, the scanned volume 109 of the Proceedings of the Entomological Society of Washington now displays a list of articles within that volume:

This means you can now jump to individual articles. Before you had to scroll through the scan, or click through page numbers until you found what you were after. The screenshot above shows the article "Three new species of chewing lice (Phthiraptera: Ischnocera: Philopteridae) from australian parrots (Psittaciformes: Psittacidae)". The details of this article have been extracted from BioStor, where this article appears as http://biostor.org/reference/55323. You can go directly to this article in BHL using the link http://www.biodiversitylibrary.org/part/69723. As an aside, I've chosen this article because it helps demonstrate that BHL has modern content as well as pre-1923 literature, and this article names a louse, Neopsittaconirmus vincesmithi after a former student of mine, Vince Smith. You're nobody in this field unless you've had a louse named after you ;)

BioStor has over 90,000 articles, but this is a tiny fraction of the articles contained in BHL content, so there's a long way to go until the entire archive is indexed to article level. There will also be errors in the article metadata derived from BioStor. If we invoke Linus's Law ("given enough eyeballs, all bugs are shallow") then having this content in BHL should help expose those errors more rapidly.

As always, I have a few niggles about the site, but I'll save those for another time. For noe, I'm happy to celebrate an extraordinary, open access archive of over 40 million pages. BHL represents one of the few truly indispensable biodiversity resources online.

BioNames ideas - automatically finding synonyms from the literature

2013-03-15T11:21:00.000-07:00

One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.

Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms.

Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms?

One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.

Atlantoxerus getulus
Sciurus getulus
Xerus getulus

If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms.

There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species).

If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here):

Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016

Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature.

Figuring Out an Accounting Career

2013-03-15T08:21:00.001-07:00

What You Can Count On: Job Security

For the 2007 fiscal year, Microsoft reported an annual revenue of $51.2 billion. Behind any company's revenue numbers--big or small--are accountants and financial managers who balance the books. In 2002, the Sarbanes-Oxley Act added further scrutiny to corporate procedures. Between government regulations and the thousands of companies that need to manage finances, the immediate benefit of a career in accounting is a reasonable amount of job security. Additionally, the Bureau of Labor Statistics (BLS) predicts strong growth for accountants and auditors through 2016.

What You Can Take to the Bank: Strong Earnings

Another benefit for an accountant is that the median annual salary for accounting, tax preparation, bookkeeping, and payroll services is $57,020. Going further into the financial services industry, you could become a financial manager for a major corporation and earn in the neighborhood of $105,410 a year according to the BLS. You can also work your way up the corporate ladder to financial director, corporate controller, or even chief financial officer (CFO).

What Education You Need: Accounting Degree and Certification

A college degree and certification are almost essential for advancement and a long term career in accounting. A bachelor's degree in accounting or a finance related topic is a solid start, and earning a Certified Public Accountant (CPA) credential furthers your employability prospects. You can even take it a step-further by earning a specialized certification such as a Certified Management Accountant (CMA), Certified Internal Auditor (CIA), Accredited Tax Advisor (ATA), or other credential. The American Institute of Certified Public Accountants reports on a survey finding that candidates with a professional certification can earn 10% more than other accountants. A graduate degree can also help you stand out from the crowd.

Questions and Answers About Starting an Accounting Career

2013-03-15T08:17:00.002-07:00

An accountant plays a very important role in the functioning and efficiency of a corporation. They provide a number of vital business services to clients including the management of financial matters, auditing, and handling tax issues. However, the specific duties performed in an accounting career will differ depending on what field the practitioner works in, be it public accounting, management accounting, government accounting, or internal auditing.

Accountants will generally use computers and special accounting programs to assist them in their duties. Accountants can summarize and organize data in particular formats to make them more suitable for storage or analysis. The programs also remove a lot of the tedious manual work of accounting out of the job. For this reason, accountants will generally have a very high level of competence with computers and many employers will require them to be proficient in these programs to help keep their work accurate.

The environment in which an accountant works will generally vary depending on what field of accounting he/she is in as well as what type of company or organization he/she works for. The vast majority of accountants work in an office setting, often with many other coworkers and colleagues; although, some accountants are self-employed and may be able to work part of their job at home as well. Most accountants work a standard 40-hour week; though, there are exceptions especially in the case of tax specialists and self-employed accountants who may work longer hours during certain times of the year.

Public accounting firms often send their accountants to their clients' place of work or residence to perform audits. In this scenario, there can also be a lot of traveling involved. Accountants who travel often will most likely use a laptop to allow for the increased mobility of their accounting programs, data, and other information needed on the job.

Accountants, regardless of their chosen field, require a proficiency in mathematics as well as business. Many accountants are unlicensed, especially in the fields of government accounting, management accounting, and internal auditing. A bachelor's degree in accounting or a related field is required to become licensed as a Certified Public Accountant (CPA), Public Accountant (PA), Registered Public Accountant (RPA), or Accounting Practitioner (AP). Some companies will require their accountants to hold master's degrees as well.

There is a large demand for accountants, and as more businesses are created in the coming years, the demand is expected to increase. The rapid expansion of business is also expected to have a large effect on the types of responsibilities accountants will have. Nevertheless, these jobs can be very competitive, and many businesses are increasing their standards by which they hire and the qualifications they demand.

Accountants who have a great knowledge of computers and many different accounting software will have a better change of employment. Also, those who have more education, training, and experience will also have an edge in the job market. It is also important for accountants to demonstrate interpersonal skills as this will also help them perform their job more effectively and get along better with clients.

How to Start Your Accounting Career

2013-03-15T08:14:00.001-07:00

You want to be an accountant. You love numbers, maths and money. So, how do you get started? Where do you go to get certified so that your services will be in demand? If you do not have any recognised qualifications, your clients will not be able to know if your standards meet their requirements. People hire Chartered Certified Accountants with a full practicing certificate because they know that they can trust in their expertise.

Any old accountancy certification will not do. You need an internationally recognised global qualification to compete in today's industry, and the ACCA qualification fits this demand perfectly. The Association of Chartered Certified Accountants (ACCA) is the world's largest international accountancy body, with over 300,000 members and students in more than 160 countries. Founded in 1904, ACCA has over 100 years of history as a leader in the development of the global accountancy profession. The United Nations has chosen the ACCA syllabus as the basis of its global accountancy curriculum and the ACCA qualification is well recognised in an ever-growing list of countries including the USA, Canada, the United Kingdom, the European Union, Australia, New Zealand, South Africa, China, Singapore, Malaysia and Pakistan. Many ACCA graduates work in premier companies such as British Airways and Price Waterhouse Coopers.

The syllabus spans 16 topics each with its own examination to test your competency in that subject. It usually takes 2 years for a student to obtain the ACCA qualification. This is broken up into the "Fundamentals" stage that consists of 9 papers and the "Professional" stage that consists of 3 papers and a choice of 4 options. With so many examinations to pass, self-study can be difficult. The good news is that there are many professional accounting and finance schools such as FTMS Global that offer ACCA courses. The better ones have a cast of highly qualified and experienced lecturers who are ACCA-certified. These teachers know what the ACCA syllabus requires and can dramatically increase your chances of passing the examinations. Definitely, it is highly recommended to enlist the aid of a mentor who can show you the ropes.

After qualifying as a Chartered Certified Accountant by passing the examinations, to obtain the practising certificate you must have had sufficient experience in a practising accountant's office. On top of all that, you must continue to keep yourself updated by attending courses on a regular basis. The ACCA is the only accountancy body that provides a disciplinary system which offers remedy if any ACCA member breaches its high standards.

Once you obtain the ACCA certification, your clients know with certainty that they can depend on:

- Your integrity

- Your absolute respect for the confidentiality of your client's affairs

- Your knowledge and expertise

- The fact that there is a regulatory body who will ensure that standards are maintained

- The fact that you must operate within a strict framework of rules and ethics

The entry requirements of the ACCA qualification are 2 A-Level passes or a bachelor's degree from a recognised university. If you do not meet this requirement, you may opt to go for the open-entry route by taking the Certified Accounting Technician (CAT) qualification first. Upon completion of the CAT course, you may progress to take the ACCA qualification.

BioNames: yet another taxonomic database

2013-03-15T03:27:00.000-07:00

Yet another taxonomic database, this time I can't blame anyone else because I'm the one building it (with some help, as I'll explain below).

BioNames was my entry in EOL's Computable Data Challenge (you can see the proposal here: http://dx.doi.org/10.6084/m9.figshare.92091). In that proposal I outlined my goal:

BioNames aims to create a biodiversity “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon, and that information is seamlessly linked together in one place. It combines classifications from EOL with animal taxonomic names from ION, and bibliographic data from multiple sources including BHL, CrossRef, and Mendeley. The goal is to create a database where the user can drill down from a taxonomic name to see the original description, track the fate of that name through successive revisions, and see other related literature. Publications that are freely available will displayed in situ. If the taxon has been sequenced, the user can see one or more phylogenetic trees for those sequences, where each sequence is in turn linked to the publication that made those sequences available. For a biologist the site provides a quick answer to the basic question “what is this taxon?”, coupled with with graphical displays of the relevant bibliographic and genomic information.

The bulk of the funding from EOL is going into interface work by Ryan Schenk (@ryanschenk), author of synynyms among other cool things. EOL's Chief Scientist Cyndy Parr (@cydparr) is providing adult supervision ("Chief Scientist", why can't I have a title like that?).

Development of BioNames is taking place in the open as much as we can, so there are some places you can see things unfold:

Key features and milestones are on Trello
Design details are on GitHub
Database is hosted by Cloudant
There is a (currently private) design document in Google Docs. I've posted a snapshot on FigShare (http://dx.doi.org/10.6084/m9.figshare.652203

I've lots of terrible code scattered around which I am in the process of organising into something usable, which I'll then post on GitHub. Working with Ryan is forcing me to be a lot more thoughtful about coding this project, which is a good thing. Currently I'm focussing on building an API that will support the kinds of things we want to do. I'm hoping to make this public shortly.

The original proposal was a tad ambitious (no, really). Most of what I hope to do exists in one form or another, but making it robust and usable is a whole other matter.

As the project takes shape I hope to post updates here. If you have any suggestions feel free to make them. The current target is to have this "out the door" by the end of May.

In defence of OpenURL: making bibliographic metadata hackable

2013-03-13T03:39:00.000-07:00

This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:

If you publish bibliographic data and don't use COinS ocoins.info you are doing it wrong (I'm looking at you @europepmc_news)
— Roderic Page (@rdmpage) March 8, 2013

This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS).

This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users.

Ed wrote:

I prefer to encourage publishers to use HTML's metadata facilities using the <meta> tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done.

That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines:

Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy).

But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links.

If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself.

This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI.

So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches?

Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet:

If you publish bibliographic data and don't use COinS you are doing it wrong

Bibliographic metadata pollution

2013-03-13T03:03:00.000-07:00

I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:

If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself.

One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications).

As Alan Ruttenberg (@alanruttenberg points out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff".

On Names Attribution, Rights, and Licensing of taxonomic names

2013-03-08T04:17:00.000-08:00

Few things have annoyed be as much as the following post on TAXACOM:

The Global Names project will host a workshop to explore options and to make recommendations as to issues that relate to Attribution, Rights and Licensing of names and compilations of names. The aim of the workshop is a report that clarifies if and how we share names.

We seek submissions from all interested parties - nomenclaturalists, taxonomists, aggregators, and users of names. Let us know what (you think) intellectual property rights apply or what rights should be associated with names and compilations of names. How can those who compile names get useful attribution for names, and what responsibilities do they have to ensure that information is authoritative. If there are rights, what kind of licensing is appropriate.

Contributions can be submitted http://names-attribution-rights-and-licensing.wikia.com/wiki/Main_Page, where you will find more information about this event.

I'm trying to work out why this seemingly innocuous post made me so mad. I think this is because I think this fundamentally framing the question the wrong way. Surely the goal is to have a list of names that is global in scope, well documented, and freely usable by all without restriction? Surely we want open and free access to fundamental biodiversity data? In which case, can we please stop having meetings and get on with making this so?

If you frame the discussion as one of "Attribution, Rights and Licensing of names and compilations of names" then you've already lost sight of the prize. You've focussed on the presumed "rights" of name compilers instead.

I would argue that names compilations are somewhat overvalued. They are basically lists of names, sometimes (all to rarely) with some degree of provenance (e.g., a citation to the original use of the name). As I've documented before (e.g., More fictional taxa and the myth of the expert taxonomic database and Fictional taxa) entirely fictional can end up in taxonomic databases with alarming ease. So any claims that these are expert-curated lists should be taken with a pinch of salt.

Furthermore, it is increasingly easy to automate building these lists, given that we have tools for finding names in text, and an ever expanding volume of digitised text becoming available. Indeed, in an ideal world where all taxonomic literature was digitised much of the rationale for taxonomic name databases would disappear (in the same way that library card catalogues are irrelevant in the age of Google). We are fast approaching the point where we can do better than experts. To give just one example, in a recent BHL interview with Gary Poore it was stated that:

For example, the name widely used name Pentastomida itself was widely attributed to Diesing, 1836, but the word did not appear in the literature until 1905.

A quick check of Google Ngrams shows this to be simply false:

I don't need taxonomic expertise to see this, I simply need decent text indexing. So, if you have a list of names, you have something that it will soon be largely possible to recreate using automated methods (i.e., text mining). With a little sophistication we could mine the literature for further details, such as synonymy, etc. Annotation and clarification of a few "edge cases" where things get tricky will always be needed, but if you want to argue that your lists deserves "Attribution, Rights and Licensing" then you fail to realise that your list is going to be increasing easy to recreate simply by crawling the web.

It seems to me that most taxonomic databases are little more than digitised 5x3 index cards, and lack any details on the provenance of the names they contain. They often don't have links to the primary literature, and if they do cite that literature they typically do so in a way that makes it hard to find the actual publication. I once gave a talk which included the slide below showing taxonomic databases as being "in the way" between taxonomists and users of taxonomic information:

In the old days building taxonomic databases required expertise and access to obscure, hard to find, physical literature. A catalogue of names was a way to summarise that information (since we couldn't share access). Now we are in an age where more and more primary taxonomic information is available to all, which removes most of the rationale for taxonomic databases. Users can go directly to taxonomic information themselves, which mean they can get the "good stuff", and maybe even cite it (giving us provenance and credit, which I regard as basically the same thing). In many ways taxonomic databases are transitional phenomena (like phone directories, remember those), and one could argue are now in the way of the taxonomists' Holy Grail, getting their work cited.

Lastly, any discussion of "Attribution, Rights and Licensing of names and compilations of names" reflects one of the great self inflicted wounds of biodiversity informatics, namely the reluctance to freely share data. As we speak terrabytes of genomics data are whizzing around the planet, people are downloading entire copies of GenBank and creating new databases. All of this without people fussing over "Attribution, Rights and Licensing." It's time for taxonomic databases to get over themselves and focus on making biodiversity data as accessible and available as genomics data.

Why the ICZN is in trouble

2013-03-01T08:00:00.000-08:00

There are many reasons why the International Commission on Zoological Nomenclature (ICZN) is in trouble, but fundamentally I think it's because of situation illustrated by following diagram.

Based on an analysis of the Index of Organism Names (ION) database that I'm currently working on, there are around 3.8 million animal names (I define "animal" loosely, the ICZN covers a number of eukaryote groups), of which around 1.5 million are "original combinations", that is, the name as originally published. The other 2 million plus names are synonyms, spelling variations, etc.

Of these 3.8 million names the ICZN itself can say very little. It has placed some 12,600 names (around 0.3% of the total) on its Official Lists and Indexes (which is where it records decisions on nomenclature), and its new register of names, ZooBank, has less than 100,000 names (i.e., less than 3% of all animal names).

The ICZN doesn't have a comprehensive database of animal names, so it can't answer the most basic questions one might have about names (e.g., "is this a name?", "can I use this name, or has somebody already used it?", "what other names have people used for this taxon?", "where was this name originally published?", "can I see the original description?", "who first said these two names are synonyms?", and so on). The ICZN has no answer to these questions. In the absence of these services, it is reduced to making decisions about a tiny fraction of the names that are in use (and there is no database of these decisions). It is no wonder that it is in such trouble.

The end of names? ICZN in financial crisis

2013-02-22T04:20:00.000-08:00

Image by Mr.checker from Wikimedia Commons

Science carries a news piece on the perilous state of the International Commission on Zoological Nomenclature (on Twitter as @ZooNom):

Pennisi, E. (2013). International Arbiter of Animal Names Faces Financial Woes. Science, 339(6122), 897–897. doi:10.1126/science.339.6122.897 (paywall)

Elizabeth Pennisi's article states:

A rose by any other name might still smell as sweet, but an animal with two scientific monikers can wreak havoc for researchers trying to study it. Since 1895, the International Commission on Zoological Nomenclature (ICZN) has helped ensure animal names are unique and long-lasting, with a panel of volunteer commissioners who maintain naming rules and resolve conflicts when they arise. But the U.K.-based charitable trust that supports all this is slated to run out of money before the year's end—and that could spell trouble. "If the trust ceases to exist it will be very difficult for the commissioners to do their work," says Michael Dixon, chair of the trust's board and director of the Natural History Museum in London. If ICZN disappeared "it would be something akin to anarchy in animal naming."

The sums of money are not huge:

The nonprofit organization that formed in 1947 to raise funds and administer the ICZN code and the journal—the International Trust for Zoological Nomenclature—has weathered other crises. But net income from its journal is only about $47,000 a year, and the trust's annual expenses now top $155,000. So reserves are about to be exhausted, Dixon says.

A few weeks ago, he sent an e-mail plea to directors of natural history museums around the world for emergency relief. In it, he proposed establishing a committee that would come up with a new financial model for the troubled organization. "This is not unlike GenBank," the database of genome sequences that receives government support, Coddington says. "It's the same distributed goods [situation], that everyone needs and nobody wants to pay for."

...

Dixon estimates the trust needs $78,000 or more to make it through the year. No single organization may be able to fund it long-term, but a network of 10 or 20 institutions might be able to kick in enough to sustain it, he says.

Maybe it's time for the ICZN to start a Jimmy Wales-style appeal, or take taxonomy to KickStarter.

Why are botanists locking away their data in JSTOR Plant Science?

2013-02-21T03:54:00.000-08:00

Somehow I get the feeling that botanists haven't got the "open data" religion. Not only is the list of plant names list behind a really bad license, but the Global Plants Initiative (GPI) hides its type images behind a JSTOR Plant Sciences paywall. Why is botany determined to keep its data under wraps?

For example, the first specimen on the JSTOR site is the GOET008353, the isotype of Aa achalensis Schltr.. You can see a thumbnail of the specimen (shown on the right), but if you want the full image you need to have a subscription, otherwise you see this message:

The resource you are attempting to access is part of JSTOR Plant Science. JSTOR Plant Science is currently being offered free of charge for all JSTOR participants and not for profit institutions. To learn more about JSTOR Plant Science, please contact plants@jstor.org.

So, without a subscription you don't get to see this in high resolution (the JSTOR site features a higher resolution image and associated viewer):

Why would herbariums hand over this imagery? I complained about this on Facebook and Chuck Miller responded that the original herbaria retain control over the images, so they aren't locked away. However, I then when to the herbarium that has this specimen (the Type Database of Herbarium Göttingen (GOET) and search for this specimen I eventually find it listed as 4966. There is no image!

So, the only place I can see this image is on JSTOR, for which I need a subscription. I'm also puzzled by the fact that JSTOR refers to this as "GOET008353", whereas the original herbarium refers to it as "4966". GBIF also has this specimen, which it refers to as GOET GOET-Typen 4966. The GOET008353 is a barcode given to types as part of the GPI digitisation programme. Unfortunately, neither the originating herbarium nor GBIF seems to know about this.

In summary, we have three databases with data on this specimen, each with a different specimen identifier, none of which link to each other, and the available imagery is behind a paywall.

Clearly botany hasn't gotten the memo about open data...

Rate of description of new animal species and that Taxatoy graph

2013-02-14T10:32:00.000-08:00

As part of the discussion on whether legacy biodiversity literature matters a graph from the following paper came up:

Sarkar, I., Schenk, R., & Norton, C. N. (2008). Exploring historical trends using taxonomic name metadata. BMC Evolutionary Biology, 8(1), 144. doi:10.1186/1471-2148-8-144

.@rmounce @caseybergman Sarkar et al. graph bogus dx.doi.org/10.1186/1471-2… see organismnames.com/metrics.htm?pa… Also we need to define "legacy"
— Roderic Page (@rdmpage) February 14, 2013

So, why is the Sarkar et al. graph bogus? Here is their graph (Fig. 3) for animals:

This is the number of new animal species described each year, estimated by parsing taxonomic names and extracting the date in the taxonomic authority. There are two prominent "spikes" which are worrying. Sarkar et al. discuss the peak in 1994:

For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles.

So, 1994 was a bumper year for describing new species of Cerambycidae? Not quite. Taxatoy is based on names in uBio, and I have a local copy of most of these names. The Cerambycidae names contain lots of duplicate names that differ only in taxon authority. For example, searching the name Ancylocera macrotela on uBio finds:


Ancylocera macrotela	
Ancylocera macrotela Aurivillius, 1912	
Ancylocera macrotela BATES Henry Walter, 1880	
Ancylocera macrotela Bates, 1880	
Ancylocera macrotela Bates, 1885	
Ancylocera macrotela Blackwelder, 1946	
Ancylocera macrotela Chemsak & Linsley, 1970	
Ancylocera macrotela Chemsak, 1963	
Ancylocera macrotela Chemsak, 1964	
Ancylocera macrotela Chemsak, Linsley & Mankins, 1980
Ancylocera macrotela Chemsak, Linsley & Noguera, 1992
Ancylocera macrotela Lameere, 1883	
Ancylocera macrotela Maes & al., 1994	
Ancylocera macrotela Monné & Giesbert, 1994	
Ancylocera macrotela Monné, 1994	
Ancylocera macrotela Noguera & Chemsak, 1996	
Ancylocera macrotela Viana, 1971

These names are chresonyms. The original name is Ancylocera macrotela Bates, 1880 (you can see first publication of this name in BHL), the rest are subsequent citations of that name (gotta love taxonomy...).

Why the spike in 1994? I suspect that this is due to the publication in 1994 of "Checklist of the Cerambycidae and Disteniidae (Coleoptera) of the Western Hemisphere" by Miguel A Monné and Edmund F Giesbert. At least 8552 names from that checklist seem to have ended up in uBio, all with the date "1994". So the spike is an artefact. Similarly, the other peak (1912) corresponds to the publication of a checklist by Per Olof Christopher Aurivillius, which contributes over 3000 names.

One reason I was suspicious of the Taxatoy graph is that it doesn't look anything like the equivalent graph from the Index of Organism Names. After a bit of fussing I've grabbed data from the ION site, and from Taxatoy's Google Code repository and created the following chart:

The data for this chart is on figshare http://dx.doi.org/10.6084/m9.figshare.156862. ION is an index of all new animal names, based on Zoological Record. I place more confidence in its data than data derived from uBio, but it clearly ION has its own issues (such as the gap after 1850, and the uneven sampling of the early years of taxonomy). The key point is that arguments on the temporal distribution of taxonomic descriptions (and the value of legacy literature) need to be aware that the data used is in pretty poor shape.

Update 2013-02-23
Jose Antonio Gonzalez Oreja pointed out in an email that the values for ION that I used were a little higher than those that appear on the ION web site. My script for retrieving those values hadn't quite worked. I've uploaded the corrected data to Figshare http://dx.doi.org/10.6084/m9.figshare.156862, updated the diagram above, and put the web calls I used to fetch the data on GitHub https://gist.github.com/rdmpage/5019153. The story doesn't change, but it helps to have the correct data.

Does the legacy biodiversity literature matter?

2013-02-14T02:10:00.000-08:00

I've just come back from a pro-iBiosphere Workshop at Leiden where the role of "legacy literature" became the subject of some discussion. This continued on Twitter as Ross Mounce (@rmounce) and I went back and forth:

@rdmpage but ~700,000 papers were published in 2009. Were there even 70,000 published in 1920? 2000-2012 contains *a lot*
— Ross Mounce (@rmounce) February 13, 2013

Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:

MAY, R. M. (1988). How Many Species Are There on Earth? Science, 241(4872), 1441-1449. doi:10.1126/science.241.4872.1441

In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:

Jones, K. E., Bielby, J., Cardillo, M., Fritz, S. A., O Dell, J., Orme, C. D. L., & Purvis, A. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. (W. K. Michener, Ed.)Ecology, 90(9), 2648-2648. doi:10.1890/08-1494.1

PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:

Decade starting	Publications
1840	1
1860	1
1890	1
1900	10
1910	4
1920	14
1930	48
1940	61
1950	114
1960	295
1970	527
1980	865
1990	1019
2000	183

The bulk of the papers came from the second half of the 20th century, and many of these are "legacy" in the sense that they are in archives like JSTOR, and hence the PDFs are based on scanned images and OCR. The oldest papers are from the 19th century, which is legacy by anyone's definition. My interpretation of this data is that even for a well-studied group such as mammals, the basic organismal-level data sought by comparative biologists is in the "legacy" literature. My suspicion is that if we attempt to build PanTHERIA-style databases for other, less well-studied taxa, the data (if it exists at all) will be found not in the modern literature (where the focus has long since moved on from the organism to genomics and system biology) but in the corpus of taxonomic and ecological literature that are being scanned and stored in digital archives.

Update
I've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.

More GBIF specimen identifier strangeness

2013-01-18T08:33:00.000-08:00

Continuing the theme of trying to map specimens cited in the literature to the equivalent GBIF records, consider the GBIF record http://data.gbif.org/occurrences/685591320, which according to GBIF is specimen "ZFMK 188762" (a [sic] holotype of Praomys hartwigi).

This is odd, because the original publication of this name (Eisentraut, M. 1968 .Beitrag zur Saugetierfauna von Kamerun. Bonner Zoologische Beitraege, 19:1-14, see PDF below) gives the type (p. 11) as "Museum A. Koenig, Kat. Nr. 68. 7").

The GBIF record includes links to images of ZFMK 188762, such as http://www.biologie.uni-ulm.de/cgi-bin/imgobj.pl?sid=T&lang=e&id=102323.

If we open this link we see that specimen is listed as "ZFMK-68.7", which matches the original description. "ZFMK-68.7" is a link to http://www.biologie.uni-ulm.de/cgi-bin/herbar.pl?herbid=188762&sid=T&lang=e, which is the record for this specimen in the SysTax database.

Note that this URL includes the number 188762, which is treated as the catalogue number by GBIF (i.e., "ZFMK 188762"). So, it seems that in the data provided by SysTax the primary key in that database (188762) has become the catalogue number in GBIF (I tried to verify this by clicking on the original provider message on the GBIF page but it failed to produce anything). This means any naive attempt to locate the specimen "ZFMK-68.7" in GBIF is going to fail because the harvesting and indexing as conflated a local primary key with the catalogue number that appears in publications that refer to this specimen.

Sometimes I think we are doing our level best to make retrieving data as hard as possible...

Thoughts on Mendeley and Elsevier

2013-01-18T03:31:00.000-08:00

Elsevier In Advanced Talks To Buy Mendeley For Around $100M To Beef Up In Social, Open Source Educ... tcrn.ch/W8L5io by @ingridlunden
— TechCrunch (@TechCrunch) January 17, 2013

The rumour that Elsevier is buying Mendeley has been greeted with a mixture of horror, anger, peppered with a few congratulations, I told you so's, and touting for new customers:

Oh FFS, nooooooo! MT @phylogenomics: Elsevier In Advanced Talks To Buy Mendeley For Around $100M techcrunch.com/2013/01/17/els… via @techcrunch
— Siouxsie Wiles (@SiouxsieW) January 17, 2013

.@srp @dancohen I imagine a coin flip deep within the Elsevier Headquarters Cave, using a gold, custom-minted BUY OR SUE coin.
— Jason Priem (@jasonpriem) January 17, 2013

By the way, over here @ Zotero headquarters (@chnm) we welcome all Mendeley users to a truly open platform for research zotero.org
— Dan Cohen (@dancohen) January 17, 2013

Here's some probably worthless speculation to add to the mix. Disclosure: I use Mendeley to manage 100,000's of references, and use the API for various projects. I'm not paying customer (but I do pay for some Internet services such as DropBox, BackBlaze, and Spotify, so it's not that I won't pay, it's just that the service Mendeley charge for doesn't interest me). I've published in Elsevier journals (most recently a couple of papers that, thanks to the efforts of Paul Craze, editor of TREE, are "free" in the sense you can download the PDF for free), and I took part in the Elsevier Grand Challenge.

So, given that I'm suitably compromised, here are some thoughts.

Elsevier suck

Elsevier are big, ugly, and at the corporate level are doing things that actively make researchers angry (see The Cost of Knowledge).

Elsevier rocks

Elsevier are one of the most innovative science publishers around. They fund challenges, are investing heavily in interactive and semantic markup of papers (for example, interactive phylogenies), and have built an app ecosystem on their publishing platform.

Mendeley sucks

Mendeley is suffering some from serious failings, most of which could be addressed with sufficient resources. The API sucks, mostly because Mendeley themselves don't actually use it. The Desktop client communicates with Mendeley's database using a different protocol, hence the API lacks the functionality needed to make truly great apps on the platform. The algorithms Mendeley use to de-duplicate their catalogue are flawed, occasionally creating entirely fictional entries.

Mendeley rocks

The way Mendeley engineered the creation of a bibliographic database in the cloud is genius, as is their recognition that the object around which scientists will cluster is the article, not the author. They helped foster the altmetrics movement, and have a great presence on Twitter and at conferences (i.e., you can talk to actual people who write code).

What happens next?

Let's assume that Elsevier does, indeed, buy Mendeley, and wants to do interesting things with Mendeley, and that Mendeley doesn't become one of the many startups that have a successful "exit" for the founders but ends up dying in the bosom of a larger company. Here are some possibilities.

Mendeley becomes iTunes for papers

Forget the "Last.fm" of papers, what about the "iTunes of papers"?. Big publishers are facing a revolt over the cost of institutional subscriptions, and journals are increasingly irrelevant as aggregations. The literature that people read is widely scattered across different outlets. Journals are archaic in the same way that music albums are mostly a thing of the past, people mix and match singles.

In the recent fight between UC Davis and Nature, Nature estimated that "CDL will be paying roughly $0.56 per download". So, why not charge a buck a paper? Mendeley's web interface is practically crying out for a "BUY THIS PAPER" button. Under this model, Elsevier has an outlet for its content that doesn't force people to subscribe to large amounts of stuff they don't want. Mendeley could be used to establish a relationship directly with paying customers, rather than institutions.

Mendeley becomes the de facto measure of research impact

But combining Mendeley's readership data with citations, Elsevier could construct powerful measures of research impact, bringing altmetrics into the mainstream. Couple this with links to institutions, and Elsevier could provide universities with all the data they need to evaluate academic performance (gulp).

Mendeley becomes an authoring tool

Managing references and inserting citations into manuscripts is one of the basic tasks facing an academic author. Authoring tools are evolving in the direction of being online, and embedding more semantic markup (e.g., these are taxon names, this is a chemical compound, this is a statement of causality). In a sense reference lists are the one form of structured markup we are already familiar with. Why not build on that and create an authoring platform?

Mendeley becomes the focus of post-publication review

Publishers have failed to crack the problem of post-publication review. Several provide the ability for readers to comment on an article online, but this has failed to take off. I think this is because the sociology is wrong, if you want a conversation you need to go where the people are, not expect them to come to you. Given that people are bookmaking papers in Mendeley, the next step is to get them to comment, or aggregate their annotations (in the same way that Amazon's Kindle can show you passages that others have highlighted).

Interesting times...

Tight versus loose coupling

2013-01-17T05:34:00.000-08:00

Following on from my previous post bemoaning the lack of links between biodiversity data sets, it's worth looking at different ways we can build these links. Specifically, data can be tightly or loosely coupled.

Tight coupling

Tight coupling uses identifiers. A good example is bibliographic citation, where we state that one reference cites another by linking DOIs. This makes it easy to store these links in a database, such as the Open Citations project which is exploring citation networks base don data from PubMed Cenral. Tight coupling also makes it easy to aggregate information from multiple sources. For example, one database may record citations of a paper, another may record citations of GenBank sequences, a third may record publication of taxonomic names. If all three databases use the same identifiers for the same publications (e.g., DOIs) we can combine them and potentially discover new things (for example, we could answer the question "how many descriptions of new species include sequence data?").

Loose coupling

In part this post has been prompted by a discussion I've been having with Paul Murray (@PaulMurrayCbr on his blog. Paul has added COinS to pages in the Australian Faunal Directory (AFD). These are snippets of HTML that encode a bibliographic reference as an OpenURL, and which browser extensions such as OpenURL Referrer for Firefox and COinS 2 OpenURL for Chrome can convert into links.

I've mapped many of the references in AFD to standard identifiers such as DOIs, or to digital libraries such as BioStor, and this tightly-coupled mapping is available in AFD on CouchDB. To date these mappings haven't been imported into AFD itself, which means that users of the original site don't have easy access to the literature that appears on that site (basically they'll have to Google each reference). However, if they have a browser extension (or the Javascript bookmarklet available from http://iphylo.org/~rpage/afd/openurl) that supports COinS, they will now see a clickable link that, in many cases, will take them to the online version of the corresponding reference.

This is an example of loose linking. The AFD site provides OpenURL links which can be resolved "just in time". Users of the AFD site can get some of the benefits of the tight linking stored in my CouchDB version of AFD, but the maintainers of AFD itself don't need to add code to handle these identifiers.

A lot of linking of biodiversity data shares this pattern. Instead of linking identifiers, one site links to another through a query. For example, NCBI taxonomy links to GBIF using URLs of the form "http://data.gbif.org/search/<taxon name>". Linking by query is potentially more robust than simply linking by URLs, especially if the target of the link doesn't ensure its identifiers are stable (GBIF, I'm looking at you). But there may be multiple ways to construct the same search query, which makes them poor candidates for use as identifiers. COinS are perhaps an extreme example, where there are at least two versions of the OpenURL standard in the wild, and the key-value pairs that make up the query can be in any order.

If the goal is to integrate data then having the same identifiers for the same thing make life a lot simpler, and means that we can switch from endless data cleaning and matching ("is this citation the same as that one?") to building systems that can tackle some of the scientific questions we are interested in. But in their absence we are left a kind of defensive programming where we expect the links to fail. Loose linking creates "soft links" that may work for humans (we get to click on a link and, with luck, see a web page) but they are less useful for mechanised tools trying to aggregate data.

When tight=loose

Although I've distinguished between tight and loose coupling, the distinction is not absolute. Indeed, one could argue that the best "tight" coupling is a form of "loose" coupling. For example, the most obvious form of tight linking is to use URLs for the things of interest. This is simple and direct, but has draw backs for both publisher and consumer. For the consumer, we are now at the mercy of the publisher's ability to keep the URLs stable. If they change (for example, publishing firm is bought by another firm, or adopts new publishing platform which generates different URLs) then the links break (not to mention that URLs for some resources, such as articles, are often conditional on how you are accessing the article, and may contain extraneous cruff such as session ids, etc.).

Likewise, the publisher is now constrained by a decision it made at the time of publication. If it decides to adopt better technology, or if circumstances otherwise change, it may find itself having to break existing identifiers. Some of this can be avoided if we designed clean URLs, such as this example http://data.rbge.org.uk/herb/E00001195 given by Roger Hyam. However, I wonder how persistent the ".uk" part of this URL will be if the Royal Botanic Garden Edinburgh finds itself in a Scotland that is no longer part of the United Kingdom.

One solution is our old friend indirection, where we put an identifier in between the consumer and the actual URL of the resource, and the consumer uses that identifier. This is the rationale for DOIs. The user gets an identifier that is unlikely to change, and hence can build systems upon that identifier. The publisher knows that they can change how they serve the corresponding data without disrupting their users, so long as they update the URL that the DOI points to. Indirection gives users the appearance of tight coupling without imposing the constraints of tight coupling on publishers.

Megascience platforms for biodiversity information: what's wrong with this picture?

2013-01-16T03:31:00.000-08:00

The journal Mycokeys has published the following paper:

Triebel, D., Hagedorn, G., & Rambold, G. (2012). An appraisal of megascience platforms for biodiversity information. MycoKeys, 5(0), 45–63. doi:10.3897/mycokeys.5.4302

This paper contains a diagram that seems innocuous enough but which I find worrying:

The nodes in the graph are "biodiversity megascience platforms", the edges are "cross-linkages and data exchange". What bothers me is that if you view biodiversity informatics through this lens then the relationships among these projects becomes the focus. Not the data, not the users, nor the questions we are trying to tackle. It is all about relationships between projects.

I want a different view of the landscape. For example, below is a very crude graph of the kinds of things I think about, namely kinds of data and their interrelationship:

What tends to happen is that this data landscape gets carved up by different projects, so we get separate databases of taxonomic names, images, publications, and specimens (these are the "megascience platforms" such as CoL, EOL, GBIF). This takes care of the nodes, but what about the edges, the links between the data? Typically what happens is lots of energy is expended on what to call these links, in other words, the development of the vocabularies and ontologies such as those curated by TDWG. This is all valuable work, but this doesn't tackle what for me is the real obstacle to progress, which is creating the links themselves. Where are the "megascience platforms" devoted to linking stuff together?

When we do have links between different kinds of data these tend to be within databases. For example, Genbank explicitly links sequences to publications in PubMed, and taxa in the NCBI taxonomy database. All three (sequence, publication, taxon) have identifiers (accession number, PubMed id, taxon id, respectively) that are widely used outside GenBank (and, indeed, are the de facto identifiers for the bioinformatics community). Part of the reason these identifiers are so widely used is because GenBank is the only real "megascience platform" in the list studied by Triebel et al. It's the only one that we can readily do science with (think BLAST searches, think of the number of databases that have repurposed GenBank data, or build on NCBI services).

Many of the questions we might ask can be formulated as paths through a diagram like the one above. For example, if I want to do phylogeography, then I want the path phylogeny -> sequence -> specimen -> locality. If I'm lucky the phylogeny is in a database and all the sequences have been georeferenced, but often the phylogeny isn't readily available digitally, I need to map the OTUs in the tree to sequences, I then need to track down the vouchers for those sequences, and obtain the localities for those sequences from, say, GBIF. Each step involves some degree of pain as we try and map identifiers from one database to those in another.

If I want to do classical alpha taxonomy I need information on taxonomic names, concepts, publications, attributes, and specimens. The digital links between these are tenuous at best (where are the links between GBIF specimen records and the publications that cite those specimens, for example?).

Focussing on so-called "platforms" is unfortunate, in my opinion, because it means that we focus on data and how we carve up responsibility for managing it (never mind what happens to data that lacks an obvious constituency). The platforms aren't what we should be focussing on, it is the relationships between data (and no, these are not the same as the relationships between the "platforms").

If I'd like to see one thing in biodiversity informatics in 2013 it is the emergence of a "platform" that makes the links the centre of their efforts. Because without the links we are not building "platforms", we are building silos.

iDigBio: You are putting identifiers on the wrong thing

2013-01-15T02:02:00.000-08:00

The Integrated Digitized Biocollections (iDigBio) project aims to advance digitising US biodiversity collections. They recently published a GUID Guide for Data Providers. In the PDF document I read this:

It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only. (emphasis added)

My heart sank. There's nothing wrong with having identifiers for metadata (apart from inviting the death spiral that is metadata about metadata), but surely the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.

Now, identifiers for metadata can be useful. For example, there is a specimen of Parathemisto japonica in the National Museum of Natural History, Smithsonian Institution with the label "USNM 100988". The NMNH web site has a picture of the index card for this specimen:

This is an image of the metadata, not the specimen itself. We could link the metadata to this image, but of course we also want to link it to the actual specimen.

Specimens are the things we collect, preserve, dissect, measure, sequence, photograph, and so on. I want to link a specimen to the sequences that have been obtains from that specimen, I want to list the publications that cite that specimen, I want to be able to aggregate data on a specimen from multiple sources, I want to be able to add annotations including misidentifications, simple typos, or missing georeferencing.

Key to this is having identifiers for specimens. Identifiers for metadata about those specimens is not good enough. By analogy with bibliographic citation, one of the important decisions CrossRef made was that DOIs for articles identify the article, not the metadata about the article, or any of the different formats (HTML, PDF, print) and article may occur in. This means we can build databases about things and relationships (this article cites that one, these articles were authored by this person, etc.).

As it stands, if we don't have identifiers for specimens then we can't link data together. For example, the frog specimen "USNM 195785" is depicted in the image below (from EOL):

It is also listed in various papers in BioStor. In the absence of a globally unique identifier for this specimen how do I make these links? "USNM 195785" won't do because there are at least four specimens in the USNM with the catalogue number "195785". The GBIF occurrence id for this specimen (http://data.gbif.org/occurrences/244405570) would be an obvious candidate, were it not for the fact that GBIF has no concept of stable identifiers and its occurrence ids regularly change.

I confess I'm flabbergasted that iDigBio has avoid tackling the issue of specimen identifiers. If any museum wants to discover how its collection is being used to support science it will want to find the citations of its specimens in scientific papers and databases. This requires identifiers for specimens.

Elsevier articles have interactive phylogenies

2012-12-07T03:44:00.000-08:00

Say what you will about Elsevier, they are certainly exploring ways to re-imagine the scientific article. In a comment on an earlier post Fabian Schreiber pointed out that Elsevier have released an app to display phylogenies in articles they publish. The app is based on jsPhyloSVGand is described here. You can see live examples in these articles:

Matos-Maraví, P. F., Peña, C., Willmott, K. R., Freitas, A. V. L., & Wahlberg, N. (2013). Systematics and evolutionary history of butterflies in the “Taygetis clade” (Nymphalidae: Satyrinae: Euptychiina): Towards a better understanding of Neotropical biogeography. Molecular Phylogenetics and Evolution, 66(1), 54–68. doi:10.1016/j.ympev.2012.09.005

Poćwierz-Kotus, A., Burzyński, A., & Wenne, R. (2010). Identification of a Tc1-like transposon integration site in the genome of the flounder (Platichthys flesus): A novel use of an inverse PCR method. Marine Genomics, 3(1), 45–50. doi:10.1016/j.margen.2010.03.001

NEXUS parser and tree viewer in Javascript

2012-12-06T15:00:00.000-08:00

Following on from the SVG experiments I've started to put some of the Javascript code for displaying phylogenies on Github. Not a repository yet, but as gists, little snippets of code. Mike Bostock has created http://bl.ocks.org/ which makes it possible to host gists as working examples, so you can play with the code "live".

The first gist takes a Newick tree, parses it and displays a tree. You can try it at https://bl.ocks.org/d/4224658/.

The second gist takes a basic NEXUS file containing a TREES block and displays a tree (try it at http://bl.ocks.org/d/4229068/ ). You can grab examples NEXUS tree files from TreeBASE such as tree Tr57874.

Why am I doing this?
Apart from "because it's fun" there are two reasons. The first is that I want a simple way to display phylogenetic trees in web pages, and doing this entirely in the web browser (Javascript parses the tree and renders it in SVG) saves me having to code this on my server. Being able to do this in the browser opens up the opportunity to embed tree descriptions in HTML, for example, and have the browser render the tree. This means the same web page can have machine-readable data (the tree description) but also generate a nice tree for the reader. As an aside, it also shows that TreeBASE could display perfectly good, interactive trees without resorting to a Java appelet.

The other reason is that the web seems to be moving to Javascript as the default language, and JSON as the standard data format. Instead of large chunks of "middleware" (written in a scripting language such as Perl, PHP, or, gack, Java) which is responsible for talking to databases on the server and sending static HTML to the web browser, we now have browsers that can support sophisticated, interactive interfaces built using HTML and Javascript. On the server side we have databases that speak HTTP (essentially removing the need for middleware), store JSON, and use Javascript as their programming language (e.g., CouchDB). In short, it's Javascript, Javascript, everywhere.

The Tree of Life

2012-12-05T12:51:00.000-08:00

The following poem by David Maddison was published in Systematic Biology (doi:10.1093/sysbio/sys057) under a CC-BY-NC license.

I think that I shall never see
A thing so awesome as the Tree
That links us all in paths of genes
Down into depths of time unseen;

Whose many branches spreading wide
House wondrous creatures of the tide,
Ocean deep and mountain tall,
Darkened cave and waterfall.

Among the branches we may find
Creatures there of every kind,
From microbe small to redwood vast,
From fungus slow to cheetah fast.

As glaciers move, strikes asteroid
A branch may vanish in the void:
At Permian's end and Tertiary's door,
The Tree was shaken to its core.

The leaves that fall are trapped in time
Beneath cold sheets of sand and lime;
But new leaves sprout as mountains rise,
Breathing life anew 'neath future skies.

On one branch the leaves burst forth:
A jointed limb of firework growth.
With inordinate fondness for splitting lines,
Armored beetles formed myriad kinds.

Wandering there among the leaves,
In awe of variants Time conceived,
We ponder the shape of branching fates,
And elusive origins of their traits.

Three billion years the Tree has grown
From replicators' first seed sown
To branches rich with progeny:
The wonder of phylogeny.

Viewing phylogenies on the web: Javascript conversion of Newick tree to SVG

2012-12-04T08:49:00.000-08:00

Quick test of an idea I'm playing with. By embedding a Newick-format tree description in HTML and adding some Javascript I can go from this:

...to this (you will need an SVG-capable browser to see anything). The Javascript parses the Newick tree, generates SVG, then replaces the Newick tree in the HTML with the corresponding picture. No need for server-side graphics, the diagram is generated by your web browser based on the Newick tree description.

((((((((219923430:0.046474,219923429:0.009145):0.037428,219923426:0.038397):0.015434,(219923419:0.022612,219923420:0.015561):0.050529):0.004828,(207366059:0.020922,207366058:0.016958):0.038734):0.003901,219923422:0.072942):0.005414,((219923443:0.038239,219923444:0.025617):0.037592,(219923423:0.056081,219923421:0.055808):0.003788):0.009743):0.001299,(219923469:0.072965,125629132:0.044638):0.012516):0.011647,(((((219923464:0.069894,((((((125628927:0.021470,219923456:0.021406):0.003083,219923455:0.021625):0.029147,219923428:0.042785):0.001234,225685777:0.037478):0.016027,((((56549933:0.003265,219923453:-0.000859):0.015462,150371743:0.009558):0.004969,219923452:0.014401):0.024398,((((((150371732:0.001735,((150371733:0,150371736:0):6.195e-05,150371735:-6.195e-05):7.410e-05):0.000580,150371734:0.001196):0.000767,(150371737:0.001274,(150371738:0,150371740:0):0.000551):0.000498):0.000905,70608555:0.003205):0.004807,150371741:0.010751):8.979e-05,150371739:0.006647):0.022090):0.012809):0.011838,219923427:0.057366):0.009364):0.004238,((219923450:0.022699,125628925:0.012519):0.048088,219923466:0.046514):0.003608):0.007025,((56549930:0.067920,219923440:0.059754):0.002384,((219923438:0.044329,219923439:0.038470):0.014514,(219923442:0.038021,(((207366060:0,207366061:0):0.001859,125628920:0.001806):0.024716,((((125628921:0.005610,207366057:0.003531):0.001354,(207366055:0.003311,207366056:0.002174):0.003225):0.011836,207366062:0.019303):0.003741,((((((207366047:0,207366048:0):0,207366049:0):0.001563,207366050:0.000272):0.002214,(207366051:0.000818,125628919:0.001017):0.000675):0.003916,207366054:0.007924):0.004138,((219923441:0.000975,207366052:-0.000975):0.000494,207366053:-0.000494):0.012373):0.010040):0.003349):0.017594):0.011029):-0.003134):0.011235):0.004149,((((219923435:0.064354,219923424:0.067340):0.002972,219923454:0.045087):0.002092,((219923460:0.027282,219923465:0.025756):0.031269,(219923462:0.017555,219923425:-0.009591):0.047358):0.006198):0.004242,(((219923463:0.031885,(219923459:0.000452,219923458:-0.000452):0.029292):0.005200,225685776:0.024691):0.020131,219923461:0.042563):0.004673):0.009128):0.001452,((56549934:0.088142,56549929:0.066475):0.004212,(219923437:0.048313,219923436:0.044997):0.014553):0.008927):0);

Here's the same tree as a phylogram: