Hackathon Challenge: Difference between revisions

Jump to navigation Jump to search
m
Line 19: Line 19:
:::; Gold CSV files : These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
:::; Gold CSV files : These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
:::; Silver CSV files : These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.
:::; Silver CSV files : These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.
== Accessing the Data Sets ==
*An AOCR VM is set up for all participants.
**host server name: aocr1.acis.ufl.edu
**user name and password given to you at our first meeting and via email.
*Sample of what you will see there:
<pre>human hand-parses the image (no errors) into a text file == gold.txt
    sample: /home/aocr/egilbert/dataset/gold/outputs
human (parses) gets the data out of the gold.txt files into a csv file (darwin core fields) == gold.csv
    sample: /home/aocr/egilbert/dataset/gold/parsed
OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files
    sample: /home/aocr/egilbert/dataset/gold/parsed
3a. human (parses) the "dirty" OCR out of these silver.txt in to darwin core fields ==silver.csv
    sample: /home/aocr/egilbert/dataset/silver/parsed</pre>


== Parameters ==
== Parameters ==
4,707

edits

Navigation menu