Augmenting OCR

From iDigBio
Revision as of 18:38, 21 September 2012 by Dpaul (Talk | contribs)

Jump to: navigation, search

Augmenting OCR Working Group (A-OCR) Overview

This is a community derived encyclopedia of information about the Augmenting OCR working group. Members of this working group and the community are encouraged to work together to develop content and deliverables that will serve the broader digitization community.

A-OCR Goals

We are focusing the efforts of the working group to put together materials to help the community get more from their OCR strategies. Topics we are gathering material on include:

  • known effective practices for getting the most from any OCR software.
  • known issues that hinder good (useful) OCR output.
  • reporting findings after working with real image data and programmers to improve parsing of OCR output.
  • lists of OCR software currently being utilized by the natural history collections community.

OCR Related Materials

Check out the following pages. We welcome your input!


  • iDigBio Augmenting OCR Workshop - Our working group is meeting in Gainesville, Florida, October 1 - 2, 2012. We've put together an exciting and challenging meeting agenda.

First, we will be developing wiki content for the community on effective use of OCR and OCR output including collective examples of what our working group has learned does (and does not) work.

We'll also be learning about the latest in Biodiversity Informatics and OCR from our working group and invited guests. Our guest from Hannover, Germany and the Herbar Digital project is Karl-Heinz Steinke. Karl-Heinz' group has been working for the last 5 years on improving OCR algorithms for recognizing handwriting and on OCR algorithms in general as part of making digitization of herbarium specimens more efficient. See ] Feature recognition for herbarium specimens (Herbar-Digital)] to learn more about this project's work.

A Hackathon is on our list. We're set up to head for the Botanical Research Institute of Texas (BRIT) in February of 2013 to make strides in just what OCR, ML and NLP can do to make our digitization efforts more efficient in producing data faster and producing data that's fit-for-use. We'll be choosing our hackathon focus and designing the hackathon together with the iDigBio IT Staff.

As part of our outreach efforts, we've set up participation in the upcoming iSchools Conference in Fort Worth, Texas where our working group is participating in three ways. We're submitting a poster, a notes paper, and hosting a half-day workshop to showcase our work and seek out potential collaborators. The iSchools theme is Data-Innovation-Wisdom which lines up perfectly with the goals of the ADBC, iDigBio and the TCNs. This conference is concurrent with our hackathon at BRIT.

Please let us know if you need assistance modifying this page: iDigBio Help Desk

Also, if you would like to learn more about wiki syntax: Mediawiki Wikitext Examples