Search this keyword

BioNames ideas - automatically finding synonyms from the literature

One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.

Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms.

Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms?

One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.

  • Atlantoxerus getulus
  • Sciurus getulus
  • Xerus getulus

If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms.

There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species).

If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here):

Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016

Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature.

Figuring Out an Accounting Career



What You Can Count On: Job Security

For the 2007 fiscal year, Microsoft reported an annual revenue of $51.2 billion. Behind any company's revenue numbers--big or small--are accountants and financial managers who balance the books. In 2002, the Sarbanes-Oxley Act added further scrutiny to corporate procedures. Between government regulations and the thousands of companies that need to manage finances, the immediate benefit of a career in accounting is a reasonable amount of job security. Additionally, the Bureau of Labor Statistics (BLS) predicts strong growth for accountants and auditors through 2016.

What You Can Take to the Bank: Strong Earnings

Another benefit for an accountant is that the median annual salary for accounting, tax preparation, bookkeeping, and payroll services is $57,020. Going further into the financial services industry, you could become a financial manager for a major corporation and earn in the neighborhood of $105,410 a year according to the BLS. You can also work your way up the corporate ladder to financial director, corporate controller, or even chief financial officer (CFO).

What Education You Need: Accounting Degree and Certification

A college degree and certification are almost essential for advancement and a long term career in accounting. A bachelor's degree in accounting or a finance related topic is a solid start, and earning a Certified Public Accountant (CPA) credential furthers your employability prospects. You can even take it a step-further by earning a specialized certification such as a Certified Management Accountant (CMA), Certified Internal Auditor (CIA), Accredited Tax Advisor (ATA), or other credential. The American Institute of Certified Public Accountants reports on a survey finding that candidates with a professional certification can earn 10% more than other accountants. A graduate degree can also help you stand out from the crowd.

Questions and Answers About Starting an Accounting Career



An accountant plays a very important role in the functioning and efficiency of a corporation. They provide a number of vital business services to clients including the management of financial matters, auditing, and handling tax issues. However, the specific duties performed in an accounting career will differ depending on what field the practitioner works in, be it public accounting, management accounting, government accounting, or internal auditing.

Accountants will generally use computers and special accounting programs to assist them in their duties. Accountants can summarize and organize data in particular formats to make them more suitable for storage or analysis. The programs also remove a lot of the tedious manual work of accounting out of the job. For this reason, accountants will generally have a very high level of competence with computers and many employers will require them to be proficient in these programs to help keep their work accurate.

The environment in which an accountant works will generally vary depending on what field of accounting he/she is in as well as what type of company or organization he/she works for. The vast majority of accountants work in an office setting, often with many other coworkers and colleagues; although, some accountants are self-employed and may be able to work part of their job at home as well. Most accountants work a standard 40-hour week; though, there are exceptions especially in the case of tax specialists and self-employed accountants who may work longer hours during certain times of the year.

Public accounting firms often send their accountants to their clients' place of work or residence to perform audits. In this scenario, there can also be a lot of traveling involved. Accountants who travel often will most likely use a laptop to allow for the increased mobility of their accounting programs, data, and other information needed on the job.

Accountants, regardless of their chosen field, require a proficiency in mathematics as well as business. Many accountants are unlicensed, especially in the fields of government accounting, management accounting, and internal auditing. A bachelor's degree in accounting or a related field is required to become licensed as a Certified Public Accountant (CPA), Public Accountant (PA), Registered Public Accountant (RPA), or Accounting Practitioner (AP). Some companies will require their accountants to hold master's degrees as well.

There is a large demand for accountants, and as more businesses are created in the coming years, the demand is expected to increase. The rapid expansion of business is also expected to have a large effect on the types of responsibilities accountants will have. Nevertheless, these jobs can be very competitive, and many businesses are increasing their standards by which they hire and the qualifications they demand.

Accountants who have a great knowledge of computers and many different accounting software will have a better change of employment. Also, those who have more education, training, and experience will also have an edge in the job market. It is also important for accountants to demonstrate interpersonal skills as this will also help them perform their job more effectively and get along better with clients.

How to Start Your Accounting Career



You want to be an accountant. You love numbers, maths and money. So, how do you get started? Where do you go to get certified so that your services will be in demand? If you do not have any recognised qualifications, your clients will not be able to know if your standards meet their requirements. People hire Chartered Certified Accountants with a full practicing certificate because they know that they can trust in their expertise.

Any old accountancy certification will not do. You need an internationally recognised global qualification to compete in today's industry, and the ACCA qualification fits this demand perfectly. The Association of Chartered Certified Accountants (ACCA) is the world's largest international accountancy body, with over 300,000 members and students in more than 160 countries. Founded in 1904, ACCA has over 100 years of history as a leader in the development of the global accountancy profession. The United Nations has chosen the ACCA syllabus as the basis of its global accountancy curriculum and the ACCA qualification is well recognised in an ever-growing list of countries including the USA, Canada, the United Kingdom, the European Union, Australia, New Zealand, South Africa, China, Singapore, Malaysia and Pakistan. Many ACCA graduates work in premier companies such as British Airways and Price Waterhouse Coopers.

The syllabus spans 16 topics each with its own examination to test your competency in that subject. It usually takes 2 years for a student to obtain the ACCA qualification. This is broken up into the "Fundamentals" stage that consists of 9 papers and the "Professional" stage that consists of 3 papers and a choice of 4 options. With so many examinations to pass, self-study can be difficult. The good news is that there are many professional accounting and finance schools such as FTMS Global that offer ACCA courses. The better ones have a cast of highly qualified and experienced lecturers who are ACCA-certified. These teachers know what the ACCA syllabus requires and can dramatically increase your chances of passing the examinations. Definitely, it is highly recommended to enlist the aid of a mentor who can show you the ropes.

After qualifying as a Chartered Certified Accountant by passing the examinations, to obtain the practising certificate you must have had sufficient experience in a practising accountant's office. On top of all that, you must continue to keep yourself updated by attending courses on a regular basis. The ACCA is the only accountancy body that provides a disciplinary system which offers remedy if any ACCA member breaches its high standards.

Once you obtain the ACCA certification, your clients know with certainty that they can depend on:

- Your integrity

- Your absolute respect for the confidentiality of your client's affairs

- Your knowledge and expertise

- The fact that there is a regulatory body who will ensure that standards are maintained

- The fact that you must operate within a strict framework of rules and ethics

The entry requirements of the ACCA qualification are 2 A-Level passes or a bachelor's degree from a recognised university. If you do not meet this requirement, you may opt to go for the open-entry route by taking the Certified Accounting Technician (CAT) qualification first. Upon completion of the CAT course, you may progress to take the ACCA qualification.

BioNames: yet another taxonomic database

Yet another taxonomic database, this time I can't blame anyone else because I'm the one building it (with some help, as I'll explain below).

BioNames was my entry in EOL's Computable Data Challenge (you can see the proposal here: http://dx.doi.org/10.6084/m9.figshare.92091). In that proposal I outlined my goal:
BioNames aims to create a biodiversity “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon, and that information is seamlessly linked together in one place. It combines classifications from EOL with animal taxonomic names from ION, and bibliographic data from multiple sources including BHL, CrossRef, and Mendeley. The goal is to create a database where the user can drill down from a taxonomic name to see the original description, track the fate of that name through successive revisions, and see other related literature. Publications that are freely available will displayed in situ. If the taxon has been sequenced, the user can see one or more phylogenetic trees for those sequences, where each sequence is in turn linked to the publication that made those sequences available. For a biologist the site provides a quick answer to the basic question “what is this taxon?”, coupled with with graphical displays of the relevant bibliographic and genomic information.

The bulk of the funding from EOL is going into interface work by Ryan Schenk (@ryanschenk), author of synynyms among other cool things. EOL's Chief Scientist Cyndy Parr (@cydparr) is providing adult supervision ("Chief Scientist", why can't I have a title like that?).

Development of BioNames is taking place in the open as much as we can, so there are some places you can see things unfold:



I've lots of terrible code scattered around which I am in the process of organising into something usable, which I'll then post on GitHub. Working with Ryan is forcing me to be a lot more thoughtful about coding this project, which is a good thing. Currently I'm focussing on building an API that will support the kinds of things we want to do. I'm hoping to make this public shortly.

The original proposal was a tad ambitious (no, really). Most of what I hope to do exists in one form or another, but making it robust and usable is a whole other matter.

As the project takes shape I hope to post updates here. If you have any suggestions feel free to make them. The current target is to have this "out the door" by the end of May.

In defence of OpenURL: making bibliographic metadata hackable

This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:



This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS).

This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users.

Ed wrote:

I prefer to encourage publishers to use HTML's metadata facilities using the <meta> tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done.

That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines:
Metadata1
Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy).

But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links.

If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself.

Metadata2
This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI.

So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches?

Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet:

If you publish bibliographic data and don't use COinS you are doing it wrong

Bibliographic metadata pollution

I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:

Google

If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself.

One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications).

As Alan Ruttenberg (@alanruttenbergpoints out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff".