From Deb Paul, @idbdeb
This 4-day hands-on short course in March investigated current trends in collecting, and focused on best practices and skills development for supporting the collection and sharing of robust, fit-for-research-use data.
What can we do to facilitate stakeholders’ access to quality data? High quality data is generated when data collection is planned before it gets collected in the field. Fixing data errors “after the fact” is expensive, and gets more expensive, the further away we get from the original specimen collecting event. Starting with richer and more standardized data, should also mean faster access to the data, for everyone.
This Field to Database (F2DB) course, was our third in a series of four biodiveristy informatics workshops*, each focusing on different stakeholders’ needs and relevant collections data and computational literacy skills. On our first day and a half, 22 participants heard from several different collectors about their collecting and data management practices and then headed for the field to put them into practice. After this, we spent three days learning more about how to use R for data cleaning, and for data research and visualization. All the course materials, links to necessary softward and workshop recordings are available on the wiki. What follows is an overview of our four days.
In the classroom, Charlotte Germain-Aubrey (Botanist, iDigBio PostDoc) and Katja Seltmann (Entomologist, TTD-TCN Project Manager) presented Why a Field-to-Database Biodiversity Informatics Workshop?. They kick-started our specimen data conversation with examples of challenges researchers face when compiling data from museum legacy records from many collections. These include (summarized from slides):
- standardizing datasets
- the need to georeference the material
- transforming lat / lon values to a standard format
- uncertainty data about any given georeference, often missing
- assumptions having to be made about some dates due to ambiguous formats
- taxon name resolution / reconciliation needed to merge datasets
- learning to manage the resulting very large datasets – very large files
After dealing with these issues, only then is this legacy data fit-for-use. Charlotte showed one example of how plant collections data are being used to model the impact of climate change and hinted at some future research plans to further investigate what is likely to happen to Florida plants when considering species clusters, movement analysis, and sea-level rise change. Katja and Charlotte showed us both the challenges and potential of collections data.
Emilio Bruna, Ecology Professor at the University of Florida, shared insights into the realities of field work with Let's go to the field! Where the best places are wet, isolated, and without internet. A story of the trials of typical fieldwork. (Hear his talk in this recording). Next up, we heard from Andrew Short (Entmologist, University of Kansas Biodiversity Institute) with Tips and Workflows for Managing Field Data: Field templates, workflow, and planning ahead for better results. And then Grant Godden (Botanist, Rancho Santa Ana Botanic Garden Post Doc) gave us his take on Using Digital Resources to Plan Field Expeditions offering hints on How to prioritize where you collect? How do you plan a collecting trip? and What kind of resources do you bring in the field? Just back from a recent field trip to Columbia, he also talked about Standards for Collection of Genomic Resources and documenting flower color.
In this recording, you can listen to Mike Webster, Ornithologist at Cornell, talking about Data and metadata standards for biodiversity media: the past, present and future and Emilio Bruna talking about the Top 10 mobile applications every biologist should know about. Are you using apps in the field? Which ones? What apps do you need that don’t yet exist? How have they facilitated your research efforts?
After all these lectures, we moved to the Natural Area Teaching Labrotory for lunch and some field work. Deb Paul (that’s me) gave a quick introduction to some relevant Data Standards to use when collecting / using field data such as Ecological Metadata Language (EML), Darwin Core (DC), Audubon Core (AC), and the new Global Genome Biodiversity Network (GGBN)) genomics data standard.
Then it was time for some hands-on collecting and animal sound recording experiences. Andy and Grant set up two collecting experiences to illustrate the need for prior planning. We learned about the challenges of keeping track of specimen identifiers, how to be sure we know which insect was found on which plant when we get back to the lab, we need to be careful when using abbreviations, and that writing a good locality description is vital (a georeference is not enough). (See Andy's Sample Field Data Collection sheet and sample field labels). Andy, we look forward to hearing about your upcoming field course at the University of Kansas. Let us know how it goes.
Using a shot-gun microphone and a recorder with headset, Mike gave us some hands-on experience capturing the sounds in nature. Have you done this? it’s amazing and quite challenging to then capture that particular specimen one has been listenting to. When trying to use some of the field apps, we also noticed a lot of variability with the georeferences our GPS phone apps returned. What’s your experience? Do you have a favorite GPS app? Have you compared it to a GPS unit?
Upon return to the iDigBio classroom space, we discovered what it’s like to plan for and collect paleontological specimens from Justin Wood's presentation and video. And for marine invertebrates, Francois Michonneau (Zoologist and iDigBio Post Doc) illustrated issues with collecting data and specimens in a marine setting. I think everyone wanted to study marine invertebrates after we saw Francois’ video and heard his talk Efficient workflow from collection to cataloging for marine invertebrates.
Common notions emerged from the lectures, field experiences, and videos, about planning for field data collection and subsequent data research and data management. We included coverage of Symbiota, Specify, Biocode’s Field Information Management System (FIMS), Arthropod Easy Capture (AEC), Silver Biology, and Arctos. Our summary group discussion helped to reveal themes such as:
- The use of standards such as Darwin Core and Audubon Media to support reproducible research
- Data Validation – the importance of planning for and creating tidy, standardized data
- Specimen Identifiers – we need to use them, store and share them
- Online resources – available to enhance the data, using one’s data skills
- Publishing – getting the data out there is important
- Planning ahead - for what data to collect, and how to collect and document it
See custom-videos made by community remote participants just for this workshop. Thank you Ed Gilbert (Symbiota), Andy Bentley (Specify), John Deck (FIMS), Amy Smith (KML files), Katja Seltmann (AEC), and Shelley James (Bishop Museum). Using remote participation and their recordings, we were able to cover even more software, methods, tools, and ideas for capturing specimen data collection that otherwise would have fit in 4 days.
After covering why it’s important to plan ahead for what data to collect, and how we might do that, we switched to hands-on skills that can make collecting, standardizing, and sharing data easier. These skills support best practices for reproducible research over a lifetime. Whether you’re a collection manager, or a collector, a botanist, or zoologist, these skills can serve to make your data easier to collect, to keep track of, to query for your research questions, to disseminate, to disover, and to cite! Most of our participants were collectors, a few were collection managers – who also collect or work closely with collectors.
So, from day two through day four, our course emphasis shifted to how to use the scripting language R (and Rstudio) for data cleaning, standardization, enhancement, and visualization. Francois enticed us with Intro to R. Derek Masaki (Developer, USGS-BISON) gave us a rationale and a workflow using R that supports reproducible research (see participant Rick Levy’s blog post). We needed to learn how to clean, standardize, and transform our data so Derek put together a hands-on R tutorial using a Bee dataset (from the Smithsonian). Now that we had learned a bit about R, R vectors, dataframes, and functions, we were ready on day 4 to learn about Application Programming Interfaces, affectionately known as APIs. Thanks Matt Collins (iDigBio Systems Administrator) for a fun, interactive introduction to the power of APIs and Using APIs in R.
We had a little extra time, and Francois jumped in to give a brief overview of two topics we don’t usually have time for in beginner courses: GitHub (versioning) and Rmarkdown. See course participant Rick Levy’s blog post to learn more!
To complete the data life-cycle picture, Molly Phillips (iDigBio Information Specialist) stepped in to give us an overview of how collection data gets to iDigBio in her talk Getting your data out there: publishing & standards with iDigBio and Todd Vision from Data Dryad joined us remotely with an in-depth talk about Publishing data on Dryad.
What is compelling from every part of this workshop is that with these 21st century skills, a scientist can do more research, faster, and in a manner that supports reproducibility and collaboration. Scientists recognize they need these skills and are asking for them.
Links to find out more:
Related blog post by Rick Levy: Rmarkdown + GitHub = Reproducible Research
Relevant links Please add yours!
NOTE: 4th in the biodiversity informatics series short courses: Managing NHC Data for Global Discoverability is coming up in September. By the time this is posted – there may be a few spots left, so don't wait to apply.