5,887
edits
(70 intermediate revisions by 5 users not shown) | |||
Line 35: | Line 35: | ||
:::; OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files : sample: ~/datasets/lichens/silver/outputs | :::; OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files : sample: ~/datasets/lichens/silver/outputs | ||
:::; human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv : sample: ~/datasets/lichens/silver/parsed | :::; human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv : sample: ~/datasets/lichens/silver/parsed | ||
=== Image Data Sets on the AOCR VM === | |||
:::;Data set 1 Lichen Images: /home/aocr/datasets/lichens/inputs/raw | |||
:::;Data set 1 Lichen OCR output text files for parsing: | |||
::::/home/aocr/datasets/lichens/outputs/tesseract | |||
::::/home/aocr/datasets/lichens/outputs/abby | |||
::::/home/aocr/datasets/lichens/outputs/gocr | |||
::::/home/aocr/datasets/lichens/outputs/ocrad | |||
::::/home/aocr/datasets/lichens/outputs/ocropus | |||
:::;Data set 1 Lichen Authority Files: /home/aocr/datasets/lichens/authorityfiles | |||
:::;Data set 2 Herbarium Sheet Images: 10000+ images in /home/aocr/datasets/herbs/inputs/raw | |||
::::5000 are from NYBG in home/aocr/sgottschalk_images.tar.gz | |||
:::;Data set 2 Herbarium Sheet OCR output text files for parsing: | |||
::::/home/aocr/datasets/herbs/outputs/gocr | |||
::::/home/aocr/datasets/herbs/outputs/ocrad | |||
::::/home/aocr/datasets/herbs/outputs/ocropus | |||
::::/home/aocr/datasets/herbs/outputs/tesseract | |||
:::;'''[[Media:01498198.jpg|SAMPLE IMAGE]]''' parsed in the SAMPLE CSV next. | |||
:::;'''[[Media:SampleCSV.jpg|SAMPLE PARSED CSV FILE]]''' to show '''column headers and values''' | |||
:::;Data set 3 Entomology Images: /home/aocr/datasets/ent/inputs/raw | |||
::::/home/aocr/oboyski | |||
::::or see /home/aocr/oboyski_images.tar.gz | |||
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy | |||
=== [[Dataset Errata]] === | |||
*known / discovered errors in the .txt, .csv files as they are found. | |||
== Parameters == | == Parameters == | ||
Line 70: | Line 100: | ||
:::; Parsed Field Evaluation : Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others. | :::; Parsed Field Evaluation : Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others. | ||
== Back to the [[2013 AOCR Hackathon Wiki| Hackathon Wiki]] == | == Back to the [[2013 AOCR Hackathon Wiki| Hackathon Wiki]] == | ||
<br/> |
edits