January 5, 2017
At the 2016 meeting of the Biodiversity Information Standards organization
(TDWG), Matthew Collins, Jen Hammock (Encycopedia of Life), and Alexander Thompson chaired a symposium titled “Big Data Analysis Methods and Techniques as Applied to Biocollections”
. The TDWG community is at a turning point that is a reflection of the success of the work that they have done over the past decades. Large biodiversity datasets and the computing resources to use them are now commonly and freely available and the community is ready to start doing analyses that cross taxonomy, space, genetics, and time. This symposium was an opportunity to bring those kinds of analyses in front of the whole TDWG group.
from the Cornell Lab of Ornithology presented a talk titled “Taking a Big Data Approach to Estimating Species Abundance”
which provided a higher-level description of the challenges and costs of analyzing large datasets with computational models. The eBird
project has addressed the computational scale issues by buying resources in Microsoft’s Azure cloud
which has reduced the time it takes to make species distribution visualizations down from days to hours but it still means that each visualization has a cost in the range of hundreds of dollars.
The last three talks related to how having an infrastructure for analyzing big data might affect the field. While it is important to introduce data science concepts to TDWG, it is equally important to reduce the barriers to working with the large and heterogeneous datasets required to apply them. As Dr. Kelling pointed out, this type of science currently requires technical engineering and money.
from the Encyclopedia of Life
(EOL) gave a talk titled “Fresh Data: what's new and what's interesting?”
. She presented a web application called Fresh Data (http://gimmefreshdata.github.io/
) which generates notifications of new species observations based on this infrastructure. This is something not possible without using both big data processing techniques and a re-envisioning of data sources not as silos that sit with individual projects but as resources to be aggregated and processed in real time. This ability to do responsive processing on data in place is important for the community to move into the big data era.
The final talk by Alexander Thompson
title "Data Quality at Scale: Bridging the Gap between Datum and Data"
covered the integration between the concepts discussed earlier in the conference during the data quality sessions and the idea of big data workflows. Data quality is a good application of these workflows because the individual measures of quality are weak and quality is often relative to the rest of the data for measures like outliers and clustering. By using clustering and classification techniques as part of the process of bringing the data into data aggregators, data can be improved beyond what a single provider can do on their own .
Thank you to all of our presenters for making this a great session. The symposium (and nearly all others at TDWG 2016) were recorded. You can watch videos of talks you missed and read more about iDigBio's participation at TDWG 2016 at https://www.idigbio.org/wiki/index.php/TDWG_2016_Annual_Conference
. Direct links to each talk hosted in iDigBio's Vimeo channel:
Clustering botanical collections data with a minimised set of features drawn from aggregated specimen data
Taking a Big Data Approach to Estimating Species Abundance
Large-scale Evaluation of Multimedia Analysis Techniques for the Monitoring of Biodiversity
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data
Fresh Data: what's new and what's interesting?
Data Quality at Scale: Bridging the Gap between Datum and Data