Difference between revisions of "Hackathon FAQ"

From iDigBio
Jump to: navigation, search
m (Questions and Answers)
m (Questions and Answers)
Line 2: Line 2:
  
 
== Questions and Answers ==
 
== Questions and Answers ==
 
+
<br/>
 
1.'''Evaluation Process: What is the format of the parsing evaluation?  What is the test?''' The test is how close you can match the human-parsed gold and silver standard CSV files.
 
1.'''Evaluation Process: What is the format of the parsing evaluation?  What is the test?''' The test is how close you can match the human-parsed gold and silver standard CSV files.
  

Revision as of 01:29, 11 January 2013

Frequently Asked Questions

Questions and Answers


1.Evaluation Process: What is the format of the parsing evaluation? What is the test? The test is how close you can match the human-parsed gold and silver standard CSV files.

2.Do we each generate output at the hackathon? Bring completed data with us? Yes, bring results. But it will be possible to generate new parsed output from existing OCR and run evaluation software again while at the hackathon. It's also possible to run partial sets of images back through OCR software and run parsing again. Given the 2-day agenda, it's probably not feasible to run OCR and output algorithms on all 10,000 images in a dataset at the hackathon.

3.What if I parse and refine the data farther than required, even parsing out more fields than set in the parameters? Any extra columns in the CSV files output by participants (not in the current specified set) are okay and don't affect metrics.

4.Hybrids: How should we treat hybrids? Omit? List both names in the scientificName field? For the scope of this hackathon, getting any taxon name from the OCR output and into the CSV file into the field aocr: verbatimScientificName is the goal. This could include the author. Concentrate on getting what's on the label captured. No farther parsing is required, but individuals wanting to go farther may certainly do so. There are inherent challenges here [more on this later] that require software taxanomic intelligence beyond this hackathon scope.

5.Probable Corrections: When information is not certain (i.e. recordedBy, scientificName), is it better to guess or to omit? aocr:verbatimScientificName preserves the original text as captured from OCR output, The field dwc:scientificName can contain corrections. Corrections are outside the scope of what we should focus on for evaluation in the hackathon, but very good topics for discussion at the hackathon. Gold CSV standard (human parsed) will retain EXACTLY the characters as seen on the label (as best as a human can read the text). Retain typos as seen on the gold standard label. For the silver you can attempt to correct.