iDigBio API Hackathon Research Group Report

mcollins's picture
Thu, 06/18/2015 - 1:57pm -- mcollins

June 3-5, 2015 (iDigBio API Hackathon) – Team Research blog, by Scott Chamberlain (rOpenSci), Matthew Collins (UF/iDigBio), Brian Franzone (Harvard University Herbaria & Libraries), Ronny M. Leder (FLMNH), François Michonneau (FLMNH/iDigBio), Tianhong Song (UC Davis, Kurator), and Mike Trizna (Smithsonian/Barcode of Life)


During the first day and a half, our group worked independently on our own projects. Then, after lunch on the second day we re-grouped and started a discussion and wrote some exploratory code on detecting duplicate specimen records both within iDigBio and between iDigBio and Vertnet. (We used Vertnet as an example of another data provider because Scott was familiar with their data set and could work with it quickly.)

Scott Chamberlain worked on adding iDigBio as a data source to the spocc (SPecies OCCurrence data) R package in the rOpenSci project. This package fetches occurrence data from GBIF, BISON, Vertnet, and other major data providers with a single interface. Initial support for iDigBio has been added. The development status can be viewed on GitHub.

The ridigbio package that provides access to the iDigBio API was improved and documented by Matthew Collins. Additional work after the hackathon resulted in this package being made available on the Comprehensive R Archive Network (CRAN). The package can now be installed in R with the command install.packages("ridigbio"). Source code and issue tracker for this package is also on GitHub.

Brian Franzone worked on integrating his existing Climate Data Fetcher with location data from the iDigBio API. We set up a temporary site running an example of a web interface showing iDigBio specimens and the weather data from nearby stations around the time they were collected. The HTML for the site is currently statically generated but this could be turned into a dynamic web application.

François Michonneau started development of Blaschka, a R Shiny application designed to visualize search results from iDigBio. A demonstration install of this application is served from a temporary machine. You can view histograms of when specimens were collected and bar plots of the percentage of records with specific fields filled out. This application is a template for how others can construct extensions to the iDigBio data using the API without needing to interact with iDigBio staff directly.

The Kurator project aims to provide a workflow and tools to curate and correct biodiversity data. It features tools to read occurrence data from SQL databases, CSV files, and Excel files. During the hackathon, Tianhong Song worked on adding a data reader to read data from the iDigBio API. Kurator generates spreadsheets containing correction information for the input files and can push correction annotations into the FilteredPush network. This initial code only generates JSON output but it is ready to hook in to an annotation system.

Mike Trizna investigated the quality and quantity of data in iDigBio that can be used to link specimen records to Genbank genetic sequences. This data is stored in the dwc:associatedSequences field but the formatting of the data is not specified. Using regular expressions, Mike was able to quantify how many sequences were stored in this field. His work is summarized in an iPython notebook that is viewable from the GitHub repository.

During the second half of the hackathon Scott, Matt, Brian, and Mike discussed approaches to finding duplicate specimen records within and across providers. We acknowledged that such duplicates were going to be common, particularly when using federating tools like the spocc R package that Scott is developing. We wrote some test code to explore the computation feasibility of record-to-record comparisons for large data sets and quickly realized this is an O(n2), that is the amount of time it will take to do the comparisons will be the square of the number of records. We still wrote some code to try it out and found that only a few hundred records was slow (30 seconds) with a single computation thread. We took notes and documented ideas in the Github repository.

Go back to read the other reports.