Hackathon Challenge: Difference between revisions

Jump to navigation Jump to search
 
(70 intermediate revisions by 5 users not shown)
Line 35: Line 35:
:::; OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files : sample: ~/datasets/lichens/silver/outputs
:::; OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files : sample: ~/datasets/lichens/silver/outputs
:::; human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv : sample: ~/datasets/lichens/silver/parsed
:::; human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv : sample: ~/datasets/lichens/silver/parsed
=== Image Data Sets on the AOCR VM ===
:::;Data set 1 Lichen Images: /home/aocr/datasets/lichens/inputs/raw
:::;Data set 1 Lichen OCR output text files for parsing:
::::/home/aocr/datasets/lichens/outputs/tesseract
::::/home/aocr/datasets/lichens/outputs/abby
::::/home/aocr/datasets/lichens/outputs/gocr
::::/home/aocr/datasets/lichens/outputs/ocrad
::::/home/aocr/datasets/lichens/outputs/ocropus
:::;Data set 1 Lichen Authority Files: /home/aocr/datasets/lichens/authorityfiles
:::;Data set 2 Herbarium Sheet Images:  10000+ images in /home/aocr/datasets/herbs/inputs/raw
::::5000 are from NYBG in home/aocr/sgottschalk_images.tar.gz
:::;Data set 2 Herbarium Sheet OCR output text files for parsing:
::::/home/aocr/datasets/herbs/outputs/gocr
::::/home/aocr/datasets/herbs/outputs/ocrad
::::/home/aocr/datasets/herbs/outputs/ocropus
::::/home/aocr/datasets/herbs/outputs/tesseract
:::;'''[[Media:01498198.jpg|SAMPLE IMAGE]]''' parsed in the SAMPLE CSV next.
:::;'''[[Media:SampleCSV.jpg|SAMPLE PARSED CSV FILE]]''' to show '''column headers and values'''
:::;Data set 3 Entomology Images: /home/aocr/datasets/ent/inputs/raw
::::/home/aocr/oboyski
::::or see /home/aocr/oboyski_images.tar.gz
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy
=== [[Dataset Errata]]  ===
*known / discovered errors in the .txt, .csv files as they are found.


== Parameters ==
== Parameters ==
Line 70: Line 100:
:::; Parsed Field Evaluation : Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
:::; Parsed Field Evaluation : Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
== Back to the [[2013 AOCR Hackathon Wiki| Hackathon Wiki]] ==
== Back to the [[2013 AOCR Hackathon Wiki| Hackathon Wiki]] ==
<br/>
5,887

edits

Navigation menu