Seeking participants for “iDigBio Augmenting OCR” workshop, October 1-2
Digitization and databasing of biological specimens requires a different process from that used in the digitization and extraction of information from written labels and notes. Most broadly, the process of digitization of notes and labels can be described in 4 main steps:
- create an image
- process the image to text using Optical Character Recognition (OCR) and/or human typists
- break the content of the text into semantically useful fields such as family, scientific name, collector, date collected, location, habitat,or growth habit
- format this information for injection into a database.
In order to identify best practices and to develop tools for this process, iDigBio (https://www.idigbio.org/) will hold an “iDigBio Augumenting OCR” workshop October 1-2, 2012, in Gainesville, Florida. The workshop will be followed by a Hack-a-thon for software developers to be held in February 2013.
The objective of the October workshop is to improve OCR (Optical Character Recognition) output and extraction of the content of biological collection specimen labels and notes so that they may be efficiently and accurately inserted into a database for future use.
Participants in the October workshop will identify OCR output products that will be useful for the community as well as metrics that will help evaluate how well different automated approaches produce these products. This process may include measures of accuracy of the OCR as well as accuracy of automated error correction and effectiveness of breaking text into meaningful semantic units (e.g., precision, recall and F-Score). Participants will also determine the specific focus of the February Hack-a-thon in order to provide well-defined programmatic goals for software developers.
The October workshop participants will help to identify and collect images representative of those are needed by the biology community. This collection of images will serve as the working set for software developers in the February Hack-a-thon.
We are seeking biologists, programmers and others involved in the digitization process to participate in this October workshop, to help plan the February Hack-a-thon, and to participate in the Hack-a-thon itself.
The wish list is available at
The wish list includes ideas and goals for optimizing machine and natural language processing algorithms used on OCR output from specimen labels.
If you are interested in participating and would like to know more, please email asap to:
Deadline Thursday, August 30th to participate in the Oct 1-2 workshop.
We are looking forward to your participation,
The iDigBio Augmenting OCR Working Group