Fixed Dataset Errata

From iDigBio
Jump to: navigation, search

Notes on Errata Fixed


Errors noted below are fixed


D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???)
(Bryan: All "special characters should be preserved by using UTF-8)
(Ed: Accented "á" fixed)

Gold label NY01075763_lg.txt has Pyrenidium actinellurn, should be Pyrenidium actinellum. Gold Parsed copies the error verbatim (as it should) and needs to be corrected if the .txt file is corrected.
/home/aocr/datasets/lichens/gold/outputs/human/NY01075763_lg.txt fixed --Dpaul 17:28, 26 February 2013 (EST)
/home/aocr/datasets/lichens/gold/parsed/human/NY01075763_lg.csv fixed --Dpaul 17:28, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/ocr/NY01075763_lg.txt fixed --Dpaul 16:33, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075763_lg.csv fixed --Dpaul 16:33, 27 February 2013 (EST)
datasets/lichens/gold/ocr/WIS-L-0012040_lg.txt: Longitude recorded as L49 (capitalized for clarity) instead of 149
/webroot/datasets/lichens/gold/ocr/WIS-L-0012040_lg.txt fixed --Dpaul 16:39, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/WIS-L-0012040_lg.csv fixed --Dpaul 16:39, 27 February 2013 (EST)

Unicode Reserved character (single quote)

The following files use Unicode Character 'PRIVATE USE TWO' (U+0092) as a single quote mark

  • NY_00617142.txt
  • NY_01334334.txt
/webroot/datasets/herb/gold/ocr/NY_00617142.txt fixed --Dpaul 16:59, 27 February 2013 (EST)
/webroot/datasets/herb/gold/ocr/NY_01334334.txt fixed --Dpaul 16:59, 27 February 2013 (EST)

Right single Quote

The following files contain the unicode character u+2019, Right Single Quotation Mark

  • datasets/lichens/gold/ocr/NY01075760_lg.txt
  • datasets/lichens/gold/ocr/NY01075761_lg.txt
  • datasets/lichens/gold/ocr/NY01075761_lg.txt
  • datasets/lichens/gold/ocr/NY01075762_lg.txt
  • datasets/lichens/gold/ocr/NY01075764_lg.txt
  • datasets/lichens/gold/ocr/NY01075768_lg.txt
  • datasets/lichens/gold/ocr/NY01075768_lg.txt
  • datasets/lichens/gold/ocr/NY01075770_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075776_lg.txt
  • datasets/lichens/gold/ocr/NY01075777_lg.txt
  • datasets/lichens/gold/ocr/NY01075779_lg.txt
  • datasets/lichens/gold/ocr/NY01075779_lg.txt
  • datasets/lichens/gold/ocr/NY01075781_lg.txt
  • datasets/lichens/gold/ocr/NY01075785_lg.txt
  • datasets/lichens/gold/ocr/NY01075785_lg.txt
  • datasets/lichens/gold/ocr/NY01075786_lg.txt
  • datasets/lichens/gold/ocr/NY01075786_lg.txt
  • datasets/lichens/gold/ocr/NY01075787_lg.txt
  • datasets/lichens/gold/ocr/NY01075787_lg.txt
  • datasets/lichens/gold/ocr/NY01075788_lg.txt
  • datasets/lichens/gold/ocr/NY01075788_lg.txt
  • datasets/lichens/gold/ocr/NY01075789_lg.txt
  • datasets/lichens/gold/ocr/NY01075789_lg.txt
  • datasets/lichens/gold/ocr/NY01075797_lg.txt
  • datasets/lichens/gold/ocr/NY01075798_lg.txt
  • datasets/lichens/gold/ocr/NY01075812_lg.txt
  • datasets/lichens/gold/ocr/NY01075817_lg.txt
  • datasets/lichens/gold/ocr/NY01075818_lg.txt
  • datasets/lichens/gold/ocr/NY01075819_lg.txt
  • datasets/lichens/gold/ocr/NY01075820_lg.txt
  • datasets/lichens/gold/ocr/NY01075821_lg.txt
  • datasets/lichens/gold/ocr/NY01075821_lg.txt
  • datasets/lichens/gold/ocr/NY01075822_lg.txt
  • datasets/lichens/gold/ocr/NY01075828_lg.txt
  • datasets/lichens/gold/ocr/NY01075829_lg.txt
  • datasets/lichens/gold/ocr/NY01075830_lg.txt
  • datasets/lichens/gold/ocr/NY01075831_lg.txt
  • datasets/lichens/gold/ocr/TENN-L-0000059_lg.txt
  • datasets/lichens/gold/ocr/TENN-L-0000073_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011728_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011730_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011736_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012033_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012035_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012039_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012082_lg.txt
/webroot/datasets/lichens/gold/ocr above files in this directory all fixed --Dpaul 17:22, 27 February 2013 (EST)

Right Double Quote

The following files contain the unicode character u+201D, Right Double Quotation Mark

  • datasets/lichens/gold/ocr/WIS-L-0012053_lg.txt
    • fixed --Dpaul 15:51, 27 February 2013 (EST)

Parse file errors

Inconsistency in Gold Parsed decimalLatitude and decimalLongitude in many labels. All omitted from NYBG lichens and Tennesee lichens. Gold Parsed WIS-L-0011728_lg.csv has decimalLatitude & decimalLongitude rounded to 3 decimal digits (e.g. 60.467). WIS-L-0011729_lg.csv has decimalLatitude rounded to 2 decimal digits (60.15), decimalLongitude rounded to 1 decimal digit (-152.6). Typical of variations found throughout the files. It's possible that trailing zeros were just stripped off, but this inconsistency makes it impossible to match all the labels with a parsing program.
Alex will change the metrics to avoid counting off for stripped trailing zeroes. --Dpaul 15:36, 27 February 2013 (EST)
Inconsistency in capitalization of verbatim fields in many Gold Parsed lichens. Example: NY01075763_lg.csv. In the label and OCR text the county is capitalized as ST. FRANCOIS, but in NY01075763_lg.csv it is title case: St. Francois. The state MISSOURI is capitalized in both the .txt and the .csv file. The scoring program is case sensitive, so any difference between the gold .csv and the program generated .csv will be marked wrong.
Alex will change the metrics to be case-insensitive. --Dpaul 17:28, 26 February 2013 (EST)
Gold Parsed NY01075759_lg.csv: verbatimEventDate is 1998-04-19, should be 19 April 1998.
/home/aocr/datasets/lichens/gold/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/home/aocr/datasets/lichens/silver/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075759_lg.csv fixed --Dpaul 17:35, 27 February 2013 (EST)
Gold Parsed NY01075759_lg.csv: eventDate is 4/19/1998, should be 1998-04-19 according to Darwin Core (http://rs.tdwg.org/dwc/terms/#eventDate).
/home/aocr/datasets/lichens/gold/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/home/aocr/datasets/lichens/silver/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075759_lg.csv fixed --Dpaul 17:35, 27 February 2013 (EST)
/webroot/datasets/lichens/silver/parsed/NY01075759_lg.csv okay --Dpaul 17:35, 27 February 2013 (EST)
Gold Parsed NY01075770_lg.csv omits collector number, but should be 852.
/home/aocr/datasets/lichens/gold/parsed/human/NY01075770_lg.csv fixed --Dpaul 18:18, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075770_lg.csv fixed --Dpaul 17:38, 27 February 2013 (EST)
Gold OCR NY01075786_lg.txt has "(Ach.) Mil'll. Arg.", but on the image label it is "(Ach.) Müll. Arg." This error is carried to the Gold Parsed .csv file (which should be corrected if the .txt file is corrected).
/home/aocr/datasets/lichens/gold/outputs/human/NY01075786_lg.txt fixed --Dpaul 18:28, 26 February 2013 (EST)
/home/aocr/datasets/lichens/gold/parsed/human/NY01075786_lg.csv fixed --Dpaul 18:28, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075786_lg.csv fixed --Dpaul 17:52, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/ocr/NY01075786_lg.txt fixed --Dpaul 17:52, 27 February 2013 (EST)


Label image NY01075760_lg.jpg had a spec of dirt next to "F. Berger", introducing an apostrophe as "Kocourkova & 'F. Berger" in the Gold OCR. Gold Parsed NY01075760_lg.csv corrected "Kocourkova & 'F. Berger" back to "Kocourkova & F. Berger", omitting the apostrophe. Probably a valid correction, but not in a verbatim field.
/home/aocr/webroot/datasets/lichens/gold/parsed/NY01075760_lg.csv changed gold parsed aocr:verbatimScientificName to include the apostrophe to be consistent for verbatim field. fixed --Dpaul 16:06, 27 February 2013 (EST)

Back to Dataset Errata
Back to the Hackathon Wiki