Hackathon Challenge: Difference between revisions

← Older edit

Hackathon Challenge (view source)

Revision as of 13:23, 6 January 2014

1,410 bytes added , 6 January 2014

→‎Image Data Sets on the AOCR VM

Joanna

5,887

edits

@@ Line 35: / Line 35: @@
 :::; OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files : sample: ~/datasets/lichens/silver/outputs
 :::; human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv : sample: ~/datasets/lichens/silver/parsed
+=== Image Data Sets on the AOCR VM ===
+:::;Data set 1 Lichen Images: /home/aocr/datasets/lichens/inputs/raw
+:::;Data set 1 Lichen OCR output text files for parsing:
+::::/home/aocr/datasets/lichens/outputs/tesseract
+::::/home/aocr/datasets/lichens/outputs/abby
+::::/home/aocr/datasets/lichens/outputs/gocr
+::::/home/aocr/datasets/lichens/outputs/ocrad
+::::/home/aocr/datasets/lichens/outputs/ocropus
+:::;Data set 1 Lichen Authority Files: /home/aocr/datasets/lichens/authorityfiles
+:::;Data set 2 Herbarium Sheet Images:  10000+ images in /home/aocr/datasets/herbs/inputs/raw
+::::5000 are from NYBG in home/aocr/sgottschalk_images.tar.gz
+:::;Data set 2 Herbarium Sheet OCR output text files for parsing:
+::::/home/aocr/datasets/herbs/outputs/gocr
+::::/home/aocr/datasets/herbs/outputs/ocrad
+::::/home/aocr/datasets/herbs/outputs/ocropus
+::::/home/aocr/datasets/herbs/outputs/tesseract
+:::;'''[[Media:01498198.jpg|SAMPLE IMAGE]]''' parsed in the SAMPLE CSV next.
+:::;'''[[Media:SampleCSV.jpg|SAMPLE PARSED CSV FILE]]''' to show '''column headers and values'''
+:::;Data set 3 Entomology Images: /home/aocr/datasets/ent/inputs/raw
+::::/home/aocr/oboyski
+::::or see /home/aocr/oboyski_images.tar.gz
+:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy
+=== [[Dataset Errata]]  ===
+*known / discovered errors in the .txt, .csv files as they are found.
 == Parameters ==
@@ Line 70: / Line 100: @@
 :::; Parsed Field Evaluation : Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
 == Back to the [[2013 AOCR Hackathon Wiki| Hackathon Wiki]] ==
+<br/>

Hackathon Challenge: Difference between revisions

Hackathon Challenge (view source)

Revision as of 13:23, 6 January 2014

Navigation menu

Search