Search this keyword

Where is the "crowd" in crowdsourcing? Mapping EOL Flickr photos

In any discussion of data gathering or data cleaning the term "crowdsourcing" inevitably comes up. A example where this approach has been successful is the Encyclopedia of Life's Flickr pool, where Flickr users upload images that are harvested by EOL.

Given that many Flickr photos are taken with cameras that have built-in GPS (such as the iPhone, the most common camera on Flickr) we could potentially use the Flickr photos not only as a source of images of living things, but to supplement existing distributional data. For example, Flickr has enough data to fairly accurately construct outlines of countries, cities, and neighbourhoods, see The Shape of Alpha, so what about organismal distribution?

This question is part of a Masters project by Jonathan McLatchie here at Glasgow, comparing distributions of taxa in GBIF with those based on Flickr photos. As part of that project the question arose "where are the Flickr photos being taken?" If most of the photos are being taken in the developed world, then there are at least two problems. The first is the obvious bias against organisms that live elsewhere (i.e., typically many photos won't be taken in those regions where you'd actually like to get more data). Secondly, the presence of zoos, wildlife parks, and botanical gardens means you are likely to get images of organisms well outside their natural range.

Jonathan suggested a "heatmap" of the Flickr photos would help, so to create this I wrote a script to grab metadata for the photos from the Encyclopedia of Life's Flickr pool, extract latitude and longitude, and draw the resulting locations on a map. I aggregated the points into 1°×1° squares, and generated a GBIF-style map of the photos:

Screenshot

Lots of photos from North America, Europe, and Australasia, as one might expect. Coverage of the rest of the globe is somewhat patchy. I guess the key question to ask is extent the "crowd" (Flickr users in this case) is essentially replicating the sampling biases already in projects like GBIF that are aggregating data from museum collections (most of which are in the developed world).

The PHP code to fetch the photo data and create the map is available in github. You'll need a Flickr API key to run the script. The github repository has an SVG version of the map (with a bitmap background). A bitmap copy of the map is available on FigShare http://dx.doi.org/10.6084/m9.figshare.92668.

UUIDs

Just for future reference:

More fictional taxa and the myth of the expert taxonomic database

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong. Here's yet another example.

If you search for the genus Raymondia in the Catalogue of Life you get multiple occurrences of the same species names, e.g.:



Both of these are listed as "provisionally accepted names", supplied by WTaxa: Electronic Catalogue of Weevil names (Curculionoidea). Clearly we can't have two species with the same name, so what's happening?

Firstly, Hustache, A., 1930 is:

Hustache A (1930) Curculionidae Gallo-Rhénans. Annales de la Société entomologique de France 99: 81-272. http://gallica.bnf.fr/ark:/12148/bpt6k6112240j/f3

On p. 246 Hustache refers to Raymondionymus fossor Aubé, 1864 (see below).

F168 highres

So, Raymondionymus fossor Hustache, A., 1930 is not a new species but simply the citation of a previously published one (it's a chresonym). Hustache cites the author of the name as Aubé, 1864, and you can see the original description by Aubé in BioStor (Description de six espèces nouvelles de Coléoptères d'Europe dont deux appartenant a deux genres nouveaux et aveugles, http://biostor.org/reference/104589). So, if the taxonomic authority should be Aubé, 1864, what about Raymondionymus fossor Ganglebauer, L., 1906? Again, if we track down the original publication (Revision der Blindrüsslergattungen Alaocyba und Raymondionymus, http://biostor.org/reference/104591) it's simply Ganglebauer citing (on p. 142) Aubé's paper, not describing a new species.

Note that the nomenclature of this weevil species is further complicated because Aubé originally described the species as Raymondia fossor, but Raymondia was already in use for a fly (see Über eine neue Fliegengattung: Raymondia, aus der Familie der Coriaceen, nebst Beschreibung zweier Arten derselben, http://biostor.org/reference/104588). To resolve this homonymy Wollaston proposed the name Raymondionymus:

Wollaston, T. V. (1873). XVIII. On the Genera of the Cossonidae. Transactions of the Royal Entomological Society of London, 21(4), 427–652. doi:10.1111/j.1365-2311.1873.tb00645.xhttp://biostor.org/reference/51301

So, we have a bit of a mess. Unfortunately this mess percolates up through other databases, for example EOL has three different pages for Raymondionymus fossor.

For me the lesson here is that relying on acquiring data from "trusted" sources, curated by "experts" is simply not a tenable strategy for building lists of taxa. If names are essential bits of biodiversity infrastructure upon which we hang other data, then these lists need to be cleaned, which means exposing them to scrutiny, and providing an easy means for errors to be flagged and corrected. Trust is something that is earned, not asserted, and it's time taxonomic databases stop claiming to be authoritative simply because they rely on expert sources. Expertise is no guarantee that you won't make errors.

For me this is one of the key reasons projects like BHL are so important. As more and more of the original literature becomes available, we lessen our reliance on "expertise". We can start to see for ourselves. In other words, "Nullius in verba" ("take nobody's word for it").