Update from the iDigBio Augmenting OCR working group


Update from the iDigBio Augmenting OCR working group (aOCR wg),

from Deb and the aOCR wg.

Over the past 16 weeks, the aOCR wg has successfully orchestrated multiple initiatives intended to address some key issues on the working group's Wish List. These include Education and Outreach (broadly) as well as plans to establish a baseline of “what’s possible” with parsing of OCR, to develop a strategy for sharing parsing algorithms, to broadcast what we learn and our plans for the future, to publish our hackathon results and to publish widely to connect with communities beyond our own.

Why the name Augmenting OCR?

First, key members of the working group agree that OCR technology itself is well developed and not likely to improve significantly in the upcoming years. Thus, we are not focusing improving OCR, but we will instead work on improving the OCR output and the analysis of the output – and then disseminate what we develop and learn.

The group recognizes that there is much room for improvement in methods to analyze and use OCR output effectively, including the need for better Machine Language (ML) and Natural Language Processing (NLP) algorithms.  Established ML/NLP tools need to be integrated into existing user interfaces for use by everyone: data transcription has been shown to be 30% faster with these tools in place. Also, web services need to be developed to offer the community a central place to access these tools. Additionally, we need to get information out to our community about what is possible with OCR output: we have begun to do this on the iDigBio Wiki. Lastly, we know that involving subject-matter experts from areas such as computer vision, image processing and information science will help us speed up digitization: humans-in-the-loop are going to be key to these strategies.

Starting with a working group meeting in October, 2012, the aOCR wg began planning a hackathon, to be held February 13-14, 2013 in Fort Worth, Texas, at the Botanical Research Institute of Texas (BRIT). We also put together a "BioBlitz," a series of submissions to the iSchools iConference 2013 that was held concurrently (February 12-15) in Forth Worth. 

The goal of the Hackathon was to work on improving parsing algorithms to extract data from OCR output and get it into databases. It was a successful and productive venture,  and we are now in the process of putting together the results of that Hackathon for publication. We are planning a white paper which will discuss the Hackathon process as a way to move forward with our key needs and to share what we learned in putting this together. You can read more about it on participant Ben W. Brumfield's blog at: iDigBio Augmenting OCR Hackathon or see us at work on Facebook: iDigBio aOCR – BRIT Hackathon.

For the BioBlitz, the aOCR wg put together four submissions for iSchools iConference2013, where we presented the following items:

·       A short paper intended to introduce the Information Science community to the aOCR, iDigBio, TCNS, and the natural history collections community digitization efforts.

·       A poster highlighting some key efforts in the aOCR working group

·       A workshop to foster a dialogue between the Information Science community and the Biodiversity Collections groups. The goal from this interaction is to help iDigBio find strategies to digitize faster and more efficiently

·       An alternative event on the final day of iConference2013 to tell the iConference group about the outcomes from our 1st Hackathon.

We are very pleased to report on these initiatives and declare them worthwhile and highly productive! We are looking forward to increasing our resource base, to continued interaction with the broader information science community, and to ongoing discussion and dissemination of our ideas, challenges, progress and solutions!