Image Selection and Processing Protocols

From iDigBio
Revision as of 14:28, 11 January 2013 by Dpaul (Talk | contribs)

Jump to: navigation, search

Decisions about Image Sets and How to Parse and Process the Data

OCR Images

  • Batch of 10,000 images per set.
  • 200 selected from the 10,000 that will serve as the Gold and Silver standards
  • Three distinct groups
  • Herbarium labels (full sheets)
    • NYBG will supply 5000 and select 100 gold
    • BRIT will supply 5000 and select 100 gold
  • Packet labels
    • CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
  • Entomology labels
  • Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
  • Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.
  • Compression: none to minor (as lossless as possible)

Processing for gold and silver images

  • 200 Hand Typed Transcriptions (Gold)
    • Transcription of the label text as close to what is on the label
    • Saved in separate text files (.txt) with file name matching image file name
    • Try to preserve vertical and horizontal order of text
    • If there is a large gap between text on a single line (e.g. left and right justification), add a single tab to represent the gap
    • Do not fix misspellings, type the words exactly as they are on the label
    • Include transcription of all handwriting
    • Transcribe all text and handwriting on full sheets (e.g. accession number stamps, annotation slips, etc)
  • 200 Hand Parsed Labels from Hand Transcription (Gold)
    • Perfectly parsed examples of the labels
    • Generated from hand typed transcriptions
    • Darwin Core terms to be used
    • CSV format (XML can be generated from this)
      • A .csv template has been supplied with examples. Not all Darwin core terms are include in template, just the more commonly used term. If needed, add from: http://rs.tdwg.org/dwc/terms/index.htm
      • CAUTION: If you are using excel to enter data, be aware that it will auto-format dates and convert the verbatim text to an Excel date, which is not preferred. However, if you add an apostrophe before dates (e.g. ‘5 Aug. 1990), the apostrophe will be removed and it will tell Excel to leave field as text as it was typed. If you reopen the csv file you should not resave unless you retype the dates with the apostrophe prefix.
    • CSV file name should matching image file name, but with .csv extension
  • 200 Additional Hand Parsed labels from Raw OCR Output (Silver)
    • Generated from raw Tesseract (?) OCR output of the same images used for the gold
    • Text should not be corrected

Back to the 2013 AOCR Hackathon Wiki