Poster Title: Whole-Dataset Analyses using Apache Spark Authors: Matthew Collins, Jorrit Poelen, Alexander Thompson
Abstract: Processing snapshots of biodiversity data providers’ entire datasets locally is an important capability. It allows broad questions to be asked across multiple data providers without needing to wait for providers to develop integrations or interfaces with each other; the providers’ web interfaces and application programming interfaces (APIs) no longer limit the way data is presented; and data can be processed at a much higher rate locally instead of through APIs.
Historically, analyzing whole datasets required too much storage and memory. Applications like Excel and R prefer to keep data in memory and working with 30 gigabytes or more of data is difficult. Languages like Python and Java are more flexible but processing this much data efficiently requires writing complicated parallel processing code. This is one reason most large data projects use a central data store on their servers with a web portal and APIs for providing data access to researchers. Examples include GBIF, iDigBio, Biodiversity Heritage Library (BHL), and Encyclopedia of Life (EOL).
In 2014, Spark became an Apache Foundation top-level project and its popularity as a big data processing engine has taken off. It is a much simpler to install and use this implementation of the map-reduce pattern of data processing than its industry-favorite predecessor, Hadoop. With Spark, arbitrary querying, joining, and reducing operations on and between entire biodiversity datasets can be done with very little code on a desktop computer or commonly available cloud computing resources.
During the iDigBio API hack-a-thon, members of the Global Biotic Interactions (GloBI) and iDigBio technical teams met and started an effort to provide infrastructure, examples, and web services using Spark to answer expensive biodiversity questions using freely available datasets. A quick name linking script was developed in the hack-a-thon. As the effort progressed, a Spark cluster was constructed and run overnight to determine all the unique values in every data field of the 44 million record iDigBio data set. Sparkonomy, an iDigBio tool, was developed to join tokenized taxon names from iDigBio to GBIF’s backbone taxonomy in a few minutes on a desktop computer. Effechecka from EOL is an early-phase web application that uses Spark jobs to construct checklists for taxon and spatial queries from iDigBio occurrence information. From these initial efforts, GloBI and iDigBio are seeking feedback and requirements for further work.
Where: Biodiversity Information Standards (TDWG) Conference 2015