Hackathon Challenge: Difference between revisions

Jump to navigation Jump to search
Line 6: Line 6:
== '''The Specific Task'''  ==
== '''The Specific Task'''  ==
<blockquote>Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible into a CSV file. These participant-generated CSV files will be compared against human hand-parsed ''gold'' and ''silver'' CSV files.</blockquote>
<blockquote>Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible into a CSV file. These participant-generated CSV files will be compared against human hand-parsed ''gold'' and ''silver'' CSV files.</blockquote>
== '''The Process''' ==
For each of the three image data sets, 200 images were selected (hand-picked) for creating a human hand-parsed standard for metrics. Three different files have been created for each of these selected images.
; ''Perfect OCR text files'' : Hand-transcribed from each image, these text files represent faithfully (exactly) what is in the image and are supposed to reflect what the output would look like if the OCR understood all the data in the image (including the handwriting).
; Gold CSV files : These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
; Silver CSV files : These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.


== Parameters ==
== Parameters ==