The 2013 AOCR Challenge
One of the most significant areas of interest for improving the utilization of OCR output is parsing. Digitization and data curation of biodiversity museum collections specimens can be sped up if the output from OCR can be parsed faster and more accurately and packaged into semantically meaningful units for insertion into a database.
The Specific Task
Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible.
- Parsers should produce at least CSV format output where the column headers are Darwin core (http://rs.tdwg.org/dwc/terms/) elements with some extended element names.
- The full set of valid categories is defined in a definition document in the parsing directory of the A-OCR virtual machine.
- All of this information needs to be classified on the label so that it can be imported to a database and shared with others over the Internet. The input to the parsing process is OCR text.
- For the hackathon there will be at least 600 examples of OCR text, in 3 groups of 200, that have been previously properly classified/parsed by humans.
- This parsed text may be used for training some learning algorithms.
- This set will also be used for evaluation of performance of parsing algorithms. *Overfitting is a potential problem so at the time of the hackathon we may provide additional testing records for evaluation.
Back to the Hackathon Wiki