4,707
edits
m (→The Process) |
m (→Parameters) |
||
Line 19: | Line 19: | ||
:::; Gold CSV files : These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files. | :::; Gold CSV files : These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files. | ||
:::; Silver CSV files : These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV. | :::; Silver CSV files : These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV. | ||
== Accessing the Data Sets == | |||
*An AOCR VM is set up for all participants. | |||
**host server name: aocr1.acis.ufl.edu | |||
**user name and password given to you at our first meeting and via email. | |||
*Sample of what you will see there: | |||
<pre>human hand-parses the image (no errors) into a text file == gold.txt | |||
sample: /home/aocr/egilbert/dataset/gold/outputs | |||
human (parses) gets the data out of the gold.txt files into a csv file (darwin core fields) == gold.csv | |||
sample: /home/aocr/egilbert/dataset/gold/parsed | |||
OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files | |||
sample: /home/aocr/egilbert/dataset/gold/parsed | |||
3a. human (parses) the "dirty" OCR out of these silver.txt in to darwin core fields ==silver.csv | |||
sample: /home/aocr/egilbert/dataset/silver/parsed</pre> | |||
== Parameters == | == Parameters == |