The 2013 AOCR Challenge
One of the most significant areas of interest for improving the utilization of OCR output is parsing. Digitization and data curation and dissemination of biodiversity museum collections specimen data can be sped up if the output from OCR can be parsed faster and more accurately and packaged into semantically meaningful units for insertion into a database.
The Specific Task
Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible into a CSV file. These participant-generated CSV files will be compared against human hand-parsed gold and silver CSV files.
For each of the three image data sets, 200 images were selected (hand-picked) for creating a human hand-parsed standard for metrics. Three different files have been created for each of these selected images.
- Perfect OCR text files
- Hand-transcribed from each image, these text files represent faithfully (exactly) what is in the image and are supposed to reflect what the output would look like if the OCR understood all the data in the image (including the handwriting).
- Gold CSV files
- These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
- Silver CSV files
- These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.
- Parsers should produce at least CSV format output where the column headers are Darwin core (http://rs.tdwg.org/dwc/terms/) elements with some extended element names.
- The full set of valid categories is defined in a definition document in the parsing directory of the A-OCR virtual machine.
- All of this information needs to be classified on the label so that it can be imported to a database and shared with others over the Internet. The input to the parsing process is OCR text.
- For the hackathon there will be at least 600 examples of OCR text, in 3 groups of 200, that have been previously properly classified/parsed by humans.
- This parsed text may be used for training some learning algorithms.
- This set will also be used for evaluation of performance of parsing algorithms.
- Overfitting is a potential problem so at the time of the hackathon we may provide additional testing records for evaluation.
- There are several potential types of input to the parsing algorithms.
- The most basic form of input is OCR text in UTF-8 format from multiple engines.
- There may optionally be OCR with exact spatial information about the location of characters on the original image.
- This will allow some algorithms to exploit spatial information to identify elements. This format is, however, not a main focus for this hackathon.
- Some data dictionaries and authority files may be provided (or you may use those you have access to) in efforts to have cleaner OCR output before parsing.
- Those wishing to pursue other goals such as image segmentation, finding specific elements, or improving usability & user interfaces to the OCR output and parsing tools are encouraged to do so and report back to the group at the hackathon.
Metrics and Evaluation
- CSV files generated by participants will be compared with CSV files created by humans.
- A Presence-Absence matrix
- Confusion Matrix
- F-Score (weighs correct / incorrect answers)
- Graphics may be created
- For example, with an F-score for each dwc element entry, we can generate a graph / histogram across all participants
Back to the Hackathon Wiki