iDigBio at BIS-TDWG 2014: some digitization, crowd-sourcing, and data use too

by Deb Paul, with Libby Ellwood and Greg Riccardi as Eds.

Four iDigBio staff members, Greg Riccardi, Pam Soltis, Libby Ellwood, and Deborah Paul (that’s me) headed to Jönköping, Sweden this year for the biodiversity informatics standards meeting of the year, BIS (TDWG) 2014, October 27 - 31. The timely conference theme: Applications and Data Standards for Sustaining Biodiversity suited iDigBio well.

Data Use.  Pam attended both the iDigBio Summit and headed to TDWG 2014 later for her presentation: Linking Diverse Data in Studies of Plant Evolution: Case Studies in Progress, in the Phylogenetics Symposium: Biodiversity data synthesis and discovery at a Tree of Life scale convened by Nico Cellinese and Hilmar Lapp. See more talks in their symposium to get a glimpse of how big data and software are making novel biodiversity research possible, and insights into what's needed to support this work into the future.

Collaboration and Identifiers.  Greg and I headed over to Stockholm for two ancillary, pre-TDWG 2014 meetings: 1) the Distributed Open-Source Development of Collection Management Systems (DINA Consortium) meeting, and 2) a Persistent Identifier Summit convened by Nico Cellinese, John Deck, and Rob Guralnick to foster the development of stakeholder-derived "community guidelines for the adoption of persistent identifiers relevant to biodiversity informatics." Stay tuned for more about this initiative. Jump over to the Dina Project to learn more about their initiative to build “an open-source Web-based information management system for natural history data.”
TDWG 2014 plenary speaker, Arturo H. Ariño, presented Shades of Grey: (Yet) Another Role for BIS-TDWG, a compelling story about finding out that you don't necessarily have what you think you have (in your collection), and what you learn about dark and grey data along the journey to digitize specimen and related data. Will we capture the data before it's not digitizable anymore? Arturo compels us to work together and sees a key role for TDWG in this effort.
Citizen Science and Crowd Sourcing. Libby moderated this session and presented a paper: Building the historical biodiversity baseline: Emerging iDigBio resources for onsite and online public participation in the digitization of biodiversity research specimens. This gave Libby a chance to talk about BIOSPEX (click the link to learn more), imaging and transcription blitzes and more - working toward a world-wide community engaged in crowd-sourced digitization facilitating biodiversity research and public engagement in science.
Digitization.  Elspeth Haston (RBGE), Deb Paul (iDigBio), and Vince Smith (NHM) collaborated to convene a TDWG symposium: Access to Digtisation Tools and Methods. Our talks focused on software and strategies open to everyone (open source, open access). We are always looking for ways to encourage international collaboration efforts for digitization. While originally planned with the Nairobi, Kenya TDWG meeting location in mind, talks were timely in Jönköping too. If you could not attend, or would like to hear a talk again or point one out to your colleagues, you can find out more about all the talks, and access the recordings and presentations on the Symposium Wiki. We were thrilled to see the momentum and collaborative development of so many strategies for capturing data and images faster, and more accurately. Projects highlighted not only use of the data, and data visualization, but also important key issues about public participation in our efforts, and the challenges of storage of the digitized materials, the national resources, we are creating.
Sentiment expressed by a digitisation symposium participant: At the national level, countries need to recognize that collections specimens, and collections data are national resources with intrinsic value deserving of sustainable support for the benefit of the nation and its citizens.
Highlights from our digitization symposium.  Do you need easy access to high-resolution images of insect drawers? If you'd like to find out more, check out Falko Glöckler's talk: The Open Drawer Project - Providing free access to high resolution images of entomological collection drawers. Or how about, automated image processing and optical character recognition in a state-of-the-art leading edge digitization project at the Botanischer Garten und Botanisches Museum Berlin-Dahlemerlin Botanic Garden (BGBM). It was fantastic to see sophisticated image analysis algorithms being put to work in the new StanDAP-Herb system. We first saw the potential of this at iDigBio two years ago when Karl-Heinz Steinke presented the Herbar-Digital’s work on automated image analysis methods for recognizing features in images. A large collaborative is developing a suite of tools that will be available to all of us to mine data being compiled. Read about it in the presentation: Enriching the legacy literature with OCR corrections and text-mined semantic metadata. And, Digitarium is putting OCR output to good use when digitizing to link entomological specimen labels with field notebook data
And speaking of Digitarium, they've been busy working on how to speed up transcription and improve the resulting data quality. Discover more about their services and DigiWeb, their new dynamic web-based environment for involving all transcription participants in the management and cleaning of transcribed data for best data quality. At the same time, they've been experimenting with a conveyor belt ("imaging line") strategy to capture insect (beetles, in case you were wondering) specimen images and data. Want to know their results? (Spoiler alert: 5 to 10 times faster than other current practices!) Check out these brand new papers for details:

High-performance digitization of natural history collections: Automated imaging lines for herbarium and insect specimens. Riitta Tegelberg, Tero Mononen, and Hannu Saarenmaa. Taxon 5 Dec 2014. NOTE: this is a preliminary hot-off-the press version that Taxon states: "will no longer be available online once replaced by the final version." So, depending on when you read this, you might have to look up the paper. DOI

DigiWeb - a workflow environment for quality assurance of transcription in digitization of natural history collections. Tero Mononen, Riitta Tegelberg, Mira Sääskilahti, Markku A. Huttunen, Marko Tähtinen, Hannu Saarenmaa. Biodiversity Infomatics Vol 9, 2014, pp. 18-29
Our experience in streamlining transcription is that much can be gained by using shared lookup services and big pools of data that are already available...Transcription in isolation is a waste of time, and thus more web services are needed.
From Vince, we learned more about Moving beyond the box: automating the digitisation of insect collections and the potential for image segmentation software + globally unique identifiers to revolutionize insect-drawer digitization. We got a reality check from Sharon Grant at the Field Museum when it comes to space (digital space) that is, and what to do when you are running out and there’s no more. Thanks for the tips Sharon - information we can use - right now! Are you wondering how we can tie the work of researchers together with the specimens, data, images, and gaming to engage the public in science? Jiri Frank enticed us with his presentation about purposeful gaming titled: Case study of reuse of digitised content by creative industry in games: Europeana Creative. Two games coming soon – including research – how it’s done – incorporated into the game. Jiri, we can’t wait to play! Jiri and his colleague also share the observation that industry is ready and waiting - and that developing these games is not necessarily an expensive proposition as we might suppose.
If you'd like to continue the digitization conversation, you can find Elspeth, Vince, and myself on twitter @emhaston @vsmithuk @idbdeb
Snippets.  More talks covering more topics - you can find out more on the wiki about:

Capturing Inventory level information about collections as a step in object to image to data workflows.

ZooSphere - Development of a software for automated spheric image capturing and interactive 3D visualization of biological collection objects.

Data Discovery and Doer Happiness: (Many) Uses for Optical Character Recognition (OCR) Output.

Managing Digitization Projects with BIOSPEX. Find out how you, a collection or data manager, might easily manage crowd-sourcing projects in the near future.

ENVIRONMENTS-EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life.

Some other sessions and workshops at TDWG included Biodiversity Data Quality: issues, methods, and tools, and two meetings that both discussed Biodiversity Informatics Skills needed in the research community. One was a workshop: Effective Biodiversity Data Management Training that gave Deb a chance to talk about Data Carpentry, and the second was a session focused on the formation of a TDWG Biodiversity Informatics (BI) Training interest group. With big data, the skills needed to manage so much data, so many different data types, across many platforms, also change. How do we effectively figure out what to teach, at what level to teach it, and how to impart the importance of remaining agile and open to learning new skills? Lots of groups across the planet are asking these questions now, and developing methods to address data and computational literacy gaps.

Where's next year’s TDWG? – Nairobi, Kenya. Organizers plan a week-long biodiversity informatics training before TDWG 2015. Possible dates being considered are for late September -- but we'll have to wait and see -- they are not set yet. Of course, lots more went on at BIS (TDWG) 2014! RDF, Bioinformatics, Cyberinfrastructure, Scientific Workflows, Annotations, Darwin Core tutorials and discussions (new terms coming soon, and one or two going away),... You can check out the complete program, and see the talks for yourself.


See more photos from the event at the TDWG 2014 Facebook photo album.