Search this keyword

Google Knowledge Graph using data from BBC and Wikipedia

Google's Knowledge Graph can enhance search results by display some structured information about a hit in your list of results. It's available in the US (i.e., you need to use www.google.com, although I have seen it occasionally appear for google.co.uk.

Fruitbat
Here is what Google displays for Eidolon helvum (the straw-coloured fruit bat). You get a snippet of text from Wikipedia, and also a map from the BBC Nature Wildlife site. Wikipedia is a well-known source of structured data (in that you can mine the infoboxes for information). The BBC site has some embedded RDFa and structured HTML, and you can also get RDF (just append ".rdf" to the URL, i.e., http://www.bbc.co.uk/nature/life/Straw-coloured_Fruit_Bat.rdf). There doesn't seem to be anything in the RDF about the distribution map, so presumably Google are extracting that information from the HTML.

It would be interesting to think about what other biodiversity data providers, such as GBIF and EOL could do to get their data incorporated into Google's Knowledge Graph, and eventually into these search result snippets.

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994) http://biostor.org/reference/105389

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to http://data.gbif.org/occurrences/363089399/. Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as http://data.gbif.org/occurrences/478001337/. At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:
urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record http://data.gbif.org/occurrences/478001337/ contains a UUID as an alternative identifier:
bc58ce6b-3cc3-459a-9f5b-4a70a026afbe

If you Google this you discover a record in the Atlas of Living Australia http://biocache.ala.org.au/occurrences/bc58ce6b-3cc3-459a-9f5b-4a70a026afbe, which also lists the URN from the now deleted GBIF record http://data.gbif.org/occurrences/363089399/.

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.

Microbiome as climate, macrobiome as weather, and a global model of biodiversity

Lp attenboroughHalf-baked idea time. Thinking about projects such as the Earth Microbiome Project and Genomic Observatories, the recent GBIC2012 meeting (I'm still digesting that meeting), and mulling over the book A Vast Machine I keep thinking about the possible parallels between climate science and biodiversity science.

One metaphor from "A Vast Machine" is the difference between "global data" and "making data global". Getting data from around the world ("global data") is one challenge, but then comes making that data global:
building complete, coherent, and consistent global data sets from incomplete, inconsistent, and heterogeneous datasources

The focus of GBIF's data portal is global data, bringing together specimen records and observations from all around the world. This is global data, but one could argue that it's not yet ready to be used for most applications. For example, GBIF doesn't give you the geographic distribution of a given species, merely where it's been recorded from (based on that subset of records that have been digitised). That's a very important start, but if we had for each species an estimated distribution based on museum records, observations, published maps, together with habitat modelling, then we'd be closer to a dataset that we could use to tackle key questions about the distribution of biodiversity.

EMP green smallBut if we continue with the theme that microbiology is the dark matter of biology, and if we look at projects like the Earth Microbiome Project, then we could argue that focussing on eukaryote, particularly macro-eukaryote such as plants, fungi, and animals, may be a mistake. To use a crude analogy, perhaps we have been focussing on the big phenomena (equivalent to thunder storms, flash floods, tornados, etc.) rather than the underlying drivers (equivalent to climatic processes such as those captured in global climate models). Certainly, any attempt to model the biosphere is going to have to include the microbiome, and indeed perhaps the microbiome would be enough to have a working model of the biosphere?

I'm simply waving my arms around here (no, really?), but it's worth thinking about whether the macroecology that conservation and biodiversity focusses on is actually the important thing to consider if you want to model fundamental biological processes. Might macro-organisms be like the weather, and the microbiome is like the climate. As humans we notice the weather, because it is at a scale that affects us directly. But if the weather is a (not entirely predictable) consequence of the climate, what is the equivalent of global climate model for biodiversity?