Image Selection and Processing Protocols: Difference between revisions

From iDigBio
Jump to navigation Jump to search
Line 7: Line 7:
*Herbarium labels (full sheets)  
*Herbarium labels (full sheets)  
**NYBG will supply 5000 and select 100 gold
**NYBG will supply 5000 and select 100 gold
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
**BRIT will supply 5000 and select 100 gold  
**BRIT will supply 5000 and select 100 gold  
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
*Packet labels
*Packet labels
**CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
**CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
*Entomology labels
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
 
*CalBug provided 523 Entomology labels
**There were 523 initial images
***The selection criteria for the 523 images was:
***The selection criteria for the 199 Gold was:
*Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
*Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
*Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.  
*Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.  

Revision as of 14:54, 14 June 2013

Decisions about Image Sets and How to Parse and Process the Data

OCR Images

  • Batch of 10,000 images per set.
  • 200 selected from the 10,000 that will serve as the Gold and Silver standards
  • Three distinct groups
  • Herbarium labels (full sheets)
    • NYBG will supply 5000 and select 100 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
    • BRIT will supply 5000 and select 100 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
  • Packet labels
    • CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
  • CalBug provided 523 Entomology labels
    • There were 523 initial images
      • The selection criteria for the 523 images was:
      • The selection criteria for the 199 Gold was:
  • Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
  • Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.
  • Compression: none to minor (as lossless as possible)

Processing for gold and silver images

  • 200 Hand Typed Transcriptions (Gold)
    • Transcription of the label text as close to what is on the label
    • Saved in separate text files (.txt) with file name matching image file name
    • Try to preserve vertical and horizontal order of text
    • If there is a large gap between text on a single line (e.g. left and right justification), add a single tab to represent the gap
    • Do not fix misspellings, type the words exactly as they are on the label
    • Include transcription of all handwriting
    • Transcribe all text and handwriting on full sheets (e.g. accession number stamps, annotation slips, etc)
  • 200 Hand Parsed Labels from Hand Transcription (Gold)
    • Perfectly parsed examples of the labels
    • Generated from hand typed transcriptions
    • Darwin Core terms to be used
    • CSV format (XML can be generated from this)
      • A .csv template has been supplied with examples. Not all Darwin core terms are include in template, just the more commonly used term. If needed, add from: http://rs.tdwg.org/dwc/terms/index.htm
      • CAUTION: If you are using excel to enter data, be aware that it will auto-format dates and convert the verbatim text to an Excel date, which is not preferred. However, if you add an apostrophe before dates (e.g. ‘5 Aug. 1990), the apostrophe will be removed and it will tell Excel to leave field as text as it was typed. If you reopen the csv file you should not resave unless you retype the dates with the apostrophe prefix.
    • CSV file name should matching image file name, but with .csv extension
  • 200 Additional Hand Parsed labels from Raw OCR Output (Silver)
    • Generated from raw Tesseract (?) OCR output of the same images used for the gold
    • Text should not be corrected

Back to the 2013 AOCR Hackathon Wiki