Five iConference2013 Talks

From iDigBio
Jump to navigation Jump to search

Six iConference2013 Talks


Introducing iDigBio and the Augmenting OCR Working Group

Deborah Paul
Highlights of this introduction include: "What is iDigBio? Who is the AOCR wg?" Why we are here at iConference2013? Where did we come from and what makes us unique? Where is our data coming from? and How can the iSchools community get involved?” The AOCR wg is working to find ways to speed up and improve access to digitization of natural history museum specimen data and we think the Information Science community can help. Five talks follow, explaining unique key parts of our story. Note the iDigBio* Augmenting Optical Character Recognition Working Group (AOCR wg) put together 4 submissions for iSchools iConference2013: this workshop, a paper, a poster, and an alternative event. All of these are concurrent with a Hackathon at the Botanical Research Institute of Texas (BRIT) as part of a strategic outreach effort. Interested parties are encouraged to participate in our Hackathon, join an existing iDigBio working group, propose and host a workshop, and contribute to our forums and online materials.
Integrated Digitized Biocollections (iDigBio) is a National Science Foundation (NSF) project funded under the Advancing Digitization of Biological Collections (ADBC) program. Thematic Collection (Museum) Networks (TCNs) are NSF-funded to digitize specimen data needed to answer grand challenge questions and provide that data to iDigBio. iDigBio is building a cyberinfrastructure to integrate data from museums across the USA, making it accessible to everyone.

Digitization of biocollections -- a grand challenge in scope, scale, and significance

Amanda Neill
The world’s natural history museums, herbaria, and other specialized collections documenting our planet’s biodiversity, collectively known as ‘biocollections,” are estimated to contain between 2 million and 3 million specimens. These biocollections are truly global in scope, in terms of the taxa, regions, and periods of time represented by their holdings. Hundreds of years of expense and effort contributed to their collection, preservation, and study. These specimens hold clues that help us understand evolution, species distributions, the introduction of pests and diseases, the movement of invasive species, and may help us to predict the effects of extinction events and climate change. By creating and sharing a digital record of a specimen object, we can increase its discoverability and use in research, yet despite technological advances and supporting grant initiatives, digitization workflow bottlenecks continue to impede the flow of data needed for Big Science, and possibly for the future of humanity. Advancements in automation of optical character recognition, natural language processing, and interfaces to support these are necessary for a transformative breakthrough in high-throughput digitization of biocollections.

The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens

Jason Best
The Apiary Project: combining OCR technology, OCR output from herbarium specimen or other images containing museum specimen data, well-developed regular expressions for parsing output to Darwin Core fields, and humans-in-the-digitization-loop for an elegant, sophisticated user-interface for a workflow designed to maximize the value of the human interaction, minimize steps, and speed data throughput.

Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens

Edward Gilbert
This talk focuses on how the Lichen and Bryophyte Climate Change TCN (LBCC) project plans to make use of Optical Character Recognition (OCR) and Natural Language Processing (NLP) to help process over 2.3 million North American lichen and bryophyte specimens held within over 60 U.S. herbaria. A strong component of the LBCC image processing workflow includes separate automated OCR and NLP processing steps that hope to capture a significant component of the metadata found on the specimen labels. After converting specimen images to text fragments using the open source Tesseract OCR engine, previously developed and custom designed NLP algorithms will be used to parse the results into the appropriate Darwin Core fields. The resulting data will be then be imported into a central database and made public through a unified user interface. The final steps will include transferring the results back to the management database systems used by each collection.
The presentation will begin with a brief overview of the LBCC project along with a simple workflow diagram. The OCR/NLP components of the workflow will be emphasized in greater detail. The Symbiota user interface will be displayed with a short explanation of how it will be used to digitize the label metadata. Current limitations and challenges with OCR and NLP processing will be explained as they pertain to this project. Several solutions to these issues that are being explored by the project will be explained in detail along with their preliminary results.

HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text

Bryan Heidorn
A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.

Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content

John Mignault
The Biodiversity Heritage Library, a global consortium of natural history and botanical libraries, is an ongoing project digitizing the legacy literature in their collections for open access. In its partnership with the Internet Archive and through their portal BHL has made 40 million pages available for open access by the global research community. The rapid growth of the text corpus has led to challenges in identifying and extracting semantic information from it, many of them similar to the challenges faced in OCR workflow and extraction from specimen labels. We will discuss the possible improvements in knowledge extraction that could result from improvements in OCR workflow and accuracy, as well as the implications for more intelligent and integrated data integration for biodiversity informatics.

Back to

iConference 2013 iDigBio AOCR WG Wiki