Hackathon FAQ

From iDigBio
Revision as of 17:19, 13 January 2013 by Dpaul (Talk | contribs)

Jump to: navigation, search

Frequently Asked Questions

Please Add Questions, Answers and Clarifications As Needed

1. (DL) Evaluation Process: What is the format of the parsing evaluation? What is the test? 
Answer: The test is how close you can match the human-parsed gold and silver standard CSV files.
2. (DL) Do we each generate output at the hackathon? Bring completed data with us? 
Answer: yes, bring results. But it will be possible to generate new parsed output from existing OCR and run evaluation software again while at the hackathon. It's also possible to run partial sets of images back through OCR software and run parsing again. Given the 2-day agenda, it's probably not feasible to run OCR and output algorithms on all 10,000 images in a dataset at the hackathon.
3. (DL) What if I parse and refine the data farther than required, even parsing out more fields than set in the parameters? 
Answer: Any extra columns in the CSV files output by participants (not in the current specified set) are okay and don't affect metrics.
4. (DL) Hybrids: How should we treat hybrids? Omit? List both names in the scientificName field? 
For the scope of this hackathon, getting any taxon name from the OCR output and into the CSV file into the field aocr: verbatimScientificName is the goal. This could include the author. Concentrate on getting what's on the label captured. No farther parsing is required, but individuals wanting to go farther may certainly do so. There are inherent challenges here [more on this later] that require software taxanomic intelligence beyond this hackathon scope.
5. (DL) Probable Corrections: When information is not certain (i.e. recordedBy, scientificName), is it better to guess or to omit? 
Answer: aocr:verbatimScientificName preserves the original text as captured from OCR output, The field dwc:scientificName can contain corrections. Corrections are outside the scope of what we should focus on for evaluation in the hackathon, but very good topics for discussion at the hackathon. Gold CSV standard (human parsed) will retain EXACTLY the characters as seen on the label (as best as a human can read the text). Retain typos as seen on the gold standard label. For the silver you can attempt to correct.
6. (DL) Minimum Threshold of Results: What is the minimum data required for parsing results to be output? All the Priority 1 fields? I'm just suggesting that lack of a file might be better than an empty or inadequate file. What are the criteria? 
Answer: Bad answers (garbled output) are worse than no answers (blank) from a confusion matrix perspective. A csv file with a blank data line, perhaps only the barcode was readable, is a good strategy for getting a better score from the confusion matrix if the parsed output is otherwise garbled or empty.
7. (DP) Hackathon Wiki: Where do I find out more about the overall hackathon? 
Answer: go to the 2013 AOCR Hackathon Wiki pages.

Back to the 2013 AOCR Hackathon Wiki