Wet Collections Digitization Workshop Report

Mon, 2013-03-11 20:13 -- gnelson

 

When iDigBio announced an upcoming wet collections digitization workshop to be held in cooperation with the Biodiversity Institute at the University of Kansas, we had no idea what to expect. As it turns out, we must have hit a nerve. By the close of the application period, well over 50 people had responded - far exceeding our expectations. Delighted with the overwhelming response, we began the difficult task of making selections from an extensive list of applicants. We wished we could invite them all. After expanding the budget and working with Andy Bentley at KU to ensure we had enough space, we ended up with nearly 50 participants representing 32 institutions for a full two days of activities.

Our group was welcomed to the University of Kansas Biodiversity Institute by director Kris Krishtalka at a Monday evening reception in the museum’s main hall, surrounded by the well-known panorama of the earth's ecosystems. Kris reinforced the importance of specimen digitization as well as the urgency with which it must proceed. His recognition of our work served as an inspiring send-off. Following Kris' comments, iDigBio's Gil Nelson, workshop coordinator, introduced and recognized workshop planning team members Andy Bentley (KU), Rob Robins (FLMNH), and Nelson Rios (Tulane) for their outstanding contributions and the numerous hours each devoted to bringing the workshop to fruition.

The first full day (see the full agenda at https://www.idigbio.org/wiki/images/e/ee/Agenda_WetCollections.pdf) began promptly at 8:15 a.m. with an introduction to iDigBio, followed by Gil Nelson's summary of the meta-decisions to be made and continually re-assessed for ensuring an effective digitization program. This overview was followed by excellent presentations from Christine Johnson of the American Museum of Natural History, Sally Bjork on the University of Michigan’s interdisciplinary exchange initiative across literature, sciences, and the arts, and Sandy Brantley’s (University of New Mexico) introduction to SCAN (Southwest Collections of Arthropods Network), one of the newest Thematic Collections Networks (TCN). Together, these presentations highlighted important issues and challenges for beginning and sustaining a digitization program. They also encouraged attendees to develop collaborations with other museum or institutional resources, including particular admonitions for forging partnerships with colleagues in the information and library sciences. Though the experience level of workshop participants ran the gamut, from those new to digitization to those well-experienced, the common thread that permeated casual conversation underscored the reality that none of us has yet arrived at the perfect digitization implementation. There is still much to learn from each other, and workshops like the one at KU can serve as important catalysts for fostering collaboration and improvement.

A pre-workshop survey distributed several weeks prior to the workshop indicated strong interests in imaging techniques for wet collections, the elements of digitization workflow design, and tools for data capture, management, and enrichment. The survey also indicated a preference for the inclusion of hands-on demonstrations among workshop activities.

Following this lead, Chris Taylor, co-PI on the InvertNet TCN, demonstrated by video the new InvertNet robot, originally conceived and designed for capturing an entire drawer of pinned insects in a single high resolution image. Acknowledging the robot's original purpose, Chris suggested ways its functionality could be harnessed for fast, efficient, and high quality capture of specimens in wet collections. Chris' proposed expansion of InvertNet technology is reminiscent of numerous ways collections managers in various disciplines are co-opting successful strategies from each other. Laura Halverson Monahan (University of Wisconsin) outlined workflows for scanning and processing 35mm transparencies, and Mark SabajPerez of the Academy of Natural Sciences at Philadelphia, offered an entertaining perspective on capturing high quality field and lab images of fishes. Kyle Luckenbill, Mark's colleague at the Academy, highlighted tools and strategies for image processing, improvement, and disposition. The contrasts between these presentations led to spirited discussion underscoring the importance of establishing a clearly defined purpose for specimen images well before before launching an imaging program. Are images primarily for distribution on the web? Archiving? Publication? Is high quality more important than high quantity? How much image enhancement is acceptable? Is it better to image everything or to only image exemplar specimens? Is it necessary to capture specimen images at all? Are there robotic or other rapid throughput imaging technologies suitable to the needs of wet collections?

During the afternoon of Day 1, workshop participants were treated to a series of demonstrations and tours (see https://www.idigbio.org/wiki/images/2/26/IDigBio_wet_collection_workshop_demonstration_stations.pdf) at the Biodiversity Institute, that covered an array of workflows, imaging equipment, label imaging, specimen imaging, image processing tools, georeferencing tools, Specify database strategies, and tours of the wet collections at KU.

The first day ended with presentations on workflow design by Sally Bjork and Gil Nelson, and the second day began with practical presentations of specific workflow implementations. Adam Cohen (University of Texas) outlined workflows pioneered and refined by the Fishes of Texas initiative, followed by Andrew Short (KU) on the processes of digitizing field books and collecting event data. Andrew emphasized the importance of recording locality and collecting-event data even when no collections are made, underscoring the value of negative data. Brian Sidlauskas rounded out the workflow topics with an overview of the recently rejuvenated Oregon State University ichthyology collection, his recent migration to Specify 6, and a detailed overview of OSU's workflows for entering data from collections cards. Brian's documentation reinforced the necessity of producing detailed written protocols to ensure technician compliance with well-thought-out digitization workflows, which turned out to be a major theme among presenters.

Deb Paul (iDigBio) presented a wide-ranging talk focusing on numerous topics beginning with a consideration of the importance of and strategies for creating globally unique identifiers for specimens, continuing with consideration of the Darwin Core Standard, collaborating with iDigBio and TCNs, ways to share data through data aggregators, guidelines for selecting a collections database, and methods for assessing an institution's digitization maturity. Deb's comments, especially her advice about unique identifiers, generated a lively discussion about the importance of identifiers, how they should be assigned and by whom, how they should be tracked in an increasingly cloud-based environment, and suggestions for best practices.  The discussion raised many questions and produced lots of opinions, but yielded little consensus. Deb and Gil pointed out that iDigBio is working diligently on this issue, with an intent to provide leadership to the biological collections community.

Deb continued with a well-received presentation about OpenRefine, a software tool pioneered byh Freebase, acquired by Google, now released to open source. OpenRefine offers sophisticated tools for cleaning and ehancing data within existing spreadsheets and other documents (e.g. XML, RDF). Participants recounted numerous frustrations with using spreadsheets for data management and cleaning, and were demonstrably appreciative to learn about OpenRefine. Judging from comments made during and following the workshop, OpenRefine ranked among the workshop's most appreciated technologies.

Workshop participants recognized that migrating specimen data from simple spreadsheets to enterprise-level relational databases is a key factor in improving data management and ensuring compatibility with data aggregators and repositories. Since many participants are using, migrating to, or adopting Specify 6.0 as their collections management software, Andy Bentley fashioned a well-rounded presentation that highlighted existing and projected advances in the software, including methods for moving data into Specify from other sources. About 435 institutions have adopted Specify as their collections management database, placing the software among the most-used collections management systems. Perhaps less well known among wet collections managers is Ed Gilbert's Symbiota portal software, which is the adopted platform for numerous participants in the SCAN project. Ed's overview of the portal software, its integrated georeferencing and annotation services, and its ability to check for duplicate collections and collecting event data across a network complemented Andy's Specify presentation and offered new insights into integrating the network-based Specify with a cloud-based Symbiota portal. Ed also included an introduction to FilteredPush, software developed largely at Harvard. FilteredPush is now integrated with both Specify and Symbiota and is designed to track and filter remote virtual annotations, return them to the providing collection, and push accepted annotations back across the network to all data sources holding a copy of the record.

Both Andy and Ed also demonstrated tools for integrating Geolocate into data entry workflows and providing methods for generating geographic coordinate data for specimen records. Many in the vertebrate collections community are familiar with georeferencing through their participation with VertNet, FishNet, Manis, or Ornis, and most have some familiarity with Geolocate, the online georeferencing software built and hosted at Tulane University. Nelson Rios, lead developer of Geolocate, followed Andy and Ed with in-depth techniques for using Geolocate as a tool for individual and collaborative georeferencing. Nelson highlighted a host of improvements and enhancements continually being added to Geolocate's functionality. He demonstrated a typical georeferencing workflow, outlined differences between web and desktop georeferencing, offered strategies for measuring uncertainty, and explained how to incorporate Geolocate's API web services into georeferencing activities.

Pre-digitization curation, beginning at the field level, continues to be a favored activity by most collectors and collections managers. Rob Robins outlined the workflows for fishes, herps, and marine invertebrates that are practiced at the Florida Museum of Natural History, University of Florida, with emphasis on georeferencing, preservation, and specimen imaging. He pointed out that curation is not always a pre-digitization activity and sometimes happens in concert with rather than before digitization. Rob offered numbered, step by step workflows and concluded that fluid collections of fishes, herps, and invertebrates follow similar methods but markedly different processes, that specimens in these groups are in the best condition for photography prior to preservation (though the conditions for pre-preservation photography are not always ideal), and that digitization in the field often comes with tradeoffs.

The workshop wrapped up with preparation-specific discussion groups, each considering three primary topics: new discoveries or revelations gleaned from the workshop, reinforcements for existing practices, and questions left unanswered. The full texts of answers to these questions are (or in some cases will be) available on the workshop wiki. In summary, new revelations included OpenRefine, the impending availability of Specify for iPad, the potential use of robotic cameras and photo arrays for specimen imaging, the potential uses of squeeze tanks for invertebrate imaging, the increasing interoperability among digitization tools, the importance of event-based cataloguing, and the recognition that workflows must be tailored to the special challenges specific to each institution. The workshop reinforced the need for clear and consistent documentation and the use of metadata, the importance of intra-institutional collaboration, and that sometimes low-tech is just as good as high tech. Questions yet to be answered included how to facilitate and use crowd-sourcing and citizen science, how to pay for digitization, what NSF really wants from digitization, how to overcome "institutional malaise," imaging strategies for vial and slide-mounted specimens (especially invertebrates), effective ways for handling bulk and/or desiccated specimens, how to deal with the additional work for an already overloaded workload, how to leverage accreditation standards for encouraging digitization practices, and most importantly, where do we start?

For those unable to attend this workshop, all workshop presentations and notes are available on the workshop wiki at https://www.idigbio.org/wiki/index.php/Wet_Collections_Digitization_Workshop. We used two collaborative notes documents, one for each day. These documents will remain active, available for review, and contain a complete set of notes and discussions from on-site and remote participants. All presentations were recorded via AdobeConnect and videoed by staff at KU. All of the recordings and videos can be found at the iDigBio Vimeo page. A selection of still images from the workshop is also available. You are encouraged to visit and contribute to this wiki and its parent wiki https://www.idigbio.org/wiki/index.php/Digitization_Resources and to participate in the growing collection of community-contributed digitization workflows, protocols, equipment specifications, papers, reviews, and opinions.