2013 hackathon data elements: Difference between revisions

From iDigBio
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
(4 intermediate revisions by the same user not shown)
Line 2: Line 2:


     Primary scoring for critical items
     Primary scoring for critical items
         dwc:catalogNumber
         dwc:catalogNumber  
         dwc:recordedBy
         dwc:recordedBy
         dwc:recordNumber
         dwc:recordNumber
Line 16: Line 16:
         dwc:verbatimLatitude
         dwc:verbatimLatitude
         dwc:verbatimLongitude
         dwc:verbatimLongitude
        dwc:verbatimElevation
     Lastly, scoring for optional items
     Lastly, scoring for optional items
         dwc:eventDate
         dwc:eventDate
Line 22: Line 23:
         dwc:decimalLongitude
         dwc:decimalLongitude
         dwc:fieldNotes
         dwc:fieldNotes
         dwc:sex
         dwc:sex      
         dwc:dateIdentified
         dwc:dateIdentified
         dwc:identifiedBy
         dwc:identifiedBy


Evaluation
Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last. Participants may decide what is more important. It's clear that who is using the data and for what purpose drives which fields are seen as of greater value. If you are trying to find duplicate voucher records, the "who" is very important. If you are an ecologist looking for evidence of an organism in nature, you are more interested in the "where" fields and not so much interested in who collected the physical object as a voucher.
 
We will attempt to provide services that can validate the outcomes of hackathon deliverables. This hackathon is not structured as a competition, but we felt it would be beneficial for participants to have some baseline to evaluate the effectiveness of  their methods.
 
OCR Text Evaluation
 
Evaluation of OCR Output will be based on a comparison to Gold Hand-Typed outputs, using confusion matrix like criteria for evaluating word presence, word correctness, and avoiding non-text garbage regions. We will attempt to avoid penalizing for attempts at text recognition in barcode and handwritten regions.
 
Parsed Field Evaluation
 
Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
 
Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last.


Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.
Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.


Back to [https://www.idigbio.org/wiki/index.php/2013_AOCR_Hackathon_Wiki 2013 Hackathon Wiki]
Back to [https://www.idigbio.org/wiki/index.php/2013_AOCR_Hackathon_Wiki 2013 Hackathon Wiki]

Latest revision as of 13:25, 26 March 2013

Target Data Elements

   Primary scoring for critical items
       dwc:catalogNumber   
       dwc:recordedBy
       dwc:recordNumber
       dwc:verbatimEventDate
       aocr:verbatimScientificName
   Secondary scoring for other key items
       aocr:verbatimInstitution
       dwc:datasetName
       dwc:verbatimLocality
       dwc:country
       dwc:stateProvince
       dwc:county
       dwc:verbatimLatitude
       dwc:verbatimLongitude
       dwc:verbatimElevation
   Lastly, scoring for optional items
       dwc:eventDate
       dwc:scientificName
       dwc:decimalLatitude
       dwc:decimalLongitude
       dwc:fieldNotes
       dwc:sex        
       dwc:dateIdentified
       dwc:identifiedBy

Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last. Participants may decide what is more important. It's clear that who is using the data and for what purpose drives which fields are seen as of greater value. If you are trying to find duplicate voucher records, the "who" is very important. If you are an ecologist looking for evidence of an organism in nature, you are more interested in the "where" fields and not so much interested in who collected the physical object as a voucher.

Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.

Back to 2013 Hackathon Wiki