The 2013 AOCR Challenge
One of the most significant areas of interest for improving the utilization of OCR output is parsing. Digitization and data curation and dissemination of biodiversity museum collections specimen data can be sped up if the output from OCR can be parsed faster and more accurately and packaged into semantically meaningful units for insertion into a database.
The Specific Task
Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible into a CSV file. These participant-generated CSV files will be compared against human hand-parsed gold and silver CSV files.
Three Data Sets
There are three data sets, that is, three different sets of images of museum specimen labels. Participants, working alone or in groups, may work on one or more data sets as they choose. The sets have been ranked, easy, medium, hard, as an estimate of how difficult it might be to successfully get good parsed data from the OCR output from each data set.
- Set 1 (easy)
- 10,000 images of Lichens, Bryophyte and Climate Change TCN, lichen and bryophyte packet labels. These are considered easy because these jpg images are of the label only and data on the label is mostly typed or printed with little or no handwriting present.
- Set 2 (medium)
- 5,000 Botanical Research Institute of Texas (BRIT) Herbarium and 5,000 New York Botanical Garden Herbarium specimen sheets. These are full sheets and again, most have been pre-selected to focus on labels containing mostly print or typed text and little handwriting. Note there are exceptions in order to make a more realistic (and more difficult) data set.
- Set 3 (hard)
- Several thousand images from the Essig Museum and the CalBug project. The gold set has not yet been created for these (in progress). Silver set creation needs to be discussed.
For each of the three image data sets, 200 images were selected (hand-picked) for creating a human hand-parsed standard for metrics. Three different files have been created for each of these selected images.
- Perfect OCR text files
- Hand-transcribed from each image, these text files represent faithfully (exactly) what is in the image and are supposed to reflect what the output would look like if the OCR understood all the data in the image (including the handwriting).
- Gold CSV files
- These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
- Silver CSV files
- These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.
Accessing the Data Sets
- An AOCR VM is set up for all participants.
- host server name: aocr1.acis.ufl.edu
- user name and password given to you at our first meeting and via email.
- Software and configuration
- services: ftp, ssh, mysql, apache
- ocr software: tesseract, jocr (gocr), ocropus, imagemagik, zbar
- Mysql username and password is the same as the login, database is aocr.
- Apache root directory is /home/aocr/webroot
- Sample of what you will see there for Set 1 (LBCC TCN lichen bryophyte packet labels):
- human hand-typed image data (no errors) into text file == gold.txt
- sample: ~/datasets/lichens/gold/outputs
- human parses data from gold.txt files into gold csv file (darwin core fields) == gold.csv
- sample: ~/datasets/lichens/gold/parsed
- OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files
- sample: ~/datasets/lichens/silver/outputs
- human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv
- sample: ~/datasets/lichens/silver/parsed
Image Data Sets on the AOCR VM
- Data set 1 Lichen Images
- Data set 1 Lichen OCR output text files for parsing
- Data set 1 Lichen Authority Files
- Data set 2 Herbarium Sheet Images
- 10000+ images in /home/aocr/datasets/herbs/inputs/raw
- 5000 are from NYBG in home/aocr/sgottschalk_images.tar.gz
- Data set 2 Herbarium Sheet OCR output text files for parsing
- SAMPLE IMAGE parsed in the SAMPLE CSV next.
- SAMPLE PARSED CSV FILE to show column headers and values
- Data set 3 Entomology Images
- or see /home/aocr/oboyski_images.tar.gz
- Data set 3 Entomology OCR output ABBYY text files for parsing
- known / discovered errors in the .txt, .csv files as they are found.
Gold Parsing Errors
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl)
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl)
Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl)
Many Gold Parse Tennessee lichen labels have country errors. Examples:
-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl)
Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period.
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).
Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.
Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.
Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"
Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.
Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...?
Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"
Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.
Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others.
Gold OCR Errors
NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.
WIS-L-0012026_lg.txt:: Several errors: Replaced the "N" in Latitude with a "K". Question mark instead of apostrophe in Longitude. Sandra Looman replace with Sandra Lcoman. Two dots after the date.
TENN-L-0000029_lg.txt adds a "1" to the scientificName ("Actinogyra muhlenbergii 1 (Ach.) Schol.").
- Parsers should produce at least CSV format output where the column headers are Darwin core (http://rs.tdwg.org/dwc/terms/) elements with some extended element names.
- List of target core data elements
- The full set of valid categories is defined in a definition document in the parsing directory of the A-OCR virtual machine.
- All of this information needs to be classified on the label so that it can be imported to a database and shared with others over the Internet. The input to the parsing process is OCR text.
- For the hackathon there will be at least 600 examples of OCR text, in 3 groups of 200, that have been previously properly classified/parsed by humans.
- This parsed text may be used for training some learning algorithms.
- This set will also be used for evaluation of performance of parsing algorithms.
- Overfitting is a potential problem so at the time of the hackathon we may provide additional testing records for evaluation.
- There are several potential types of input to the parsing algorithms.
- The most basic form of input is OCR text in UTF-8 format from multiple engines.
- There may optionally be OCR with exact spatial information about the location of characters on the original image.
- This will allow some algorithms to exploit spatial information to identify elements. This format is, however, not a main focus for this hackathon.
- Some data dictionaries and authority files may be provided (or you may use those you have access to) in efforts to have cleaner OCR output before parsing.
- Lichen authority files can be found in: ~/datasets/lichens/authorityfiles/
- Those wishing to pursue other goals such as image segmentation, finding specific elements, or improving usability & user interfaces to the OCR output and parsing tools are encouraged to do so and report back to the group at the hackathon.
Metrics and Evaluation
- CSV files generated by participants will be compared with CSV files created by humans.
- a Presence-Absence matrix
- Confusion Matrix
- F-Score (weighs correct / incorrect answers)
- Time needed to generate CSV output from running algorithms
- (MG) suggested adding this metric if possible, at our 11 Jan 2013 virtual meeting.
- Graphics may be created
- For example, with an F-score for each dwc element entry, we can generate a graph / histogram across all participants.
- We will attempt to provide services that can validate the outcomes of hackathon deliverables. This hackathon is not structured as a competition, but we felt it would be beneficial for participants to have some baseline to evaluate the effectiveness of their methods.
- OCR Text Evaluation
- Evaluation of OCR Output will be based on a comparison to Gold Hand-Typed outputs, using confusion matrix like criteria for evaluating word presence, word correctness, and avoiding non-text garbage regions. We will attempt to avoid penalizing for attempts at text recognition in barcode and handwritten regions.
- Parsed Field Evaluation
- Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
Back to the Hackathon Wiki