Difference between revisions of "2013 hackathon data elements"

From iDigBio
Jump to: navigation, search
m (Created page with "Parsed Field Evaluation Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for part...")
 
m
Line 1: Line 1:
Parsed Field Evaluation
+
Target Data Elements
Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a genus that is correctly labeled as a genus will add one to the diagonal. If a genus is incorrectly marked as a species, a 1 is added to the “genus” row under the species column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
+
  
 
     Primary scoring for critical items
 
     Primary scoring for critical items
Line 27: Line 26:
 
         dwc:identifiedBy
 
         dwc:identifiedBy
  
 +
Parsed Field Evaluation
 +
 +
Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
  
 
Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last.
 
Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last.
  
 
Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.
 
Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.

Revision as of 18:07, 10 January 2013

Target Data Elements

   Primary scoring for critical items
       dwc:catalogNumber
       dwc:recordedBy
       dwc:recordNumber
       dwc:verbatimEventDate
       aocr:verbatimScientificName
   Secondary scoring for other key items
       aocr:verbatimInstitution
       dwc:datasetName
       dwc:verbatimLocality
       dwc:country
       dwc:stateProvince
       dwc:county
       dwc:verbatimLatitude
       dwc:verbatimLongitude
   Lastly, scoring for optional items
       dwc:eventDate
       dwc:scientificName
       dwc:decimalLatitude
       dwc:decimalLongitude
       dwc:fieldNotes
       dwc:sex
       dwc:dateIdentified
       dwc:identifiedBy

Parsed Field Evaluation

Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.

Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last.

Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.