Image Selection and Processing Protocols

From iDigBio
Jump to: navigation, search

Decisions about Image Sets and How to Parse and Process the Data

OCR Images

  • Batch of 10,000 images per set.
  • 200 selected from the 10,000 that will serve as the Gold and Silver standards
  • Three distinct groups
  • Herbarium labels (full sheets)
    • NYBG will supply 5000 and select 100 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
    • BRIT will supply 5000 and select 100 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
  • Packet labels
    • CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
      • The selection criteria for the 5000 images was:
      • The selection criteria for the 100 Gold was:
  • CalBug provided 523 Entomology labels
    • There were 523 initial images
      • The selection criteria for the 523 images was:
      • The selection criteria for the 199 Gold was:
  • Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
  • Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.
  • Compression: none to minor (as lossless as possible)

Processing for gold and silver images

  • 200 Hand Typed Transcriptions (Gold)
    • Transcription of the label text as close to what is on the label
    • Saved in separate text files (.txt) with file name matching image file name
    • Try to preserve vertical and horizontal order of text
    • If there is a large gap between text on a single line (e.g. left and right justification), add a single tab to represent the gap
    • Do not fix misspellings, type the words exactly as they are on the label
    • Include transcription of all handwriting
    • Transcribe all text and handwriting on full sheets (e.g. accession number stamps, annotation slips, etc)
  • 200 Hand Parsed Labels from Hand Transcription (Gold)
    • Perfectly parsed examples of the labels
    • Generated from hand typed transcriptions
    • Darwin Core terms to be used
    • CSV format (XML can be generated from this)
      • A .csv template has been supplied with examples. Not all Darwin core terms are include in template, just the more commonly used term. If needed, add from: http://rs.tdwg.org/dwc/terms/index.htm
      • CAUTION: If you are using excel to enter data, be aware that it will auto-format dates and convert the verbatim text to an Excel date, which is not preferred. However, if you add an apostrophe before dates (e.g. ‘5 Aug. 1990), the apostrophe will be removed and it will tell Excel to leave field as text as it was typed. If you reopen the csv file you should not resave unless you retype the dates with the apostrophe prefix.
    • CSV file name should matching image file name, but with .csv extension
  • 200 Additional Hand Parsed labels from Raw OCR Output (Silver)
    • Generated from raw Tesseract (?) OCR output of the same images used for the gold
    • Text should not be corrected

Specific Standard Parsing Decisions

verbatimCoordinates
do not include the words latitude and longitude, just the values with a space between (do not add a comma).
verbatimEventDate
enter just as is on label
eventDate and dateIdentified
use yyyy-mm-dd format. Use yyyy if only the year, use yyyy-mm if you have Feb. 1990, ...
host and habitat
for our purposes, we were collecting the habitat field. If host data is present on the label, it was parsed into the habitat field. Please put the data into this field and do not add a comma between the host and habitat info, just a space. Also please put the host and habitat information into this field in the same order as they appear on the label.


Back to the 2013 AOCR Hackathon Wiki