Difference between revisions of "Hackathon Challenge"

From iDigBio
Jump to: navigation, search
(Dataset Errata)
(Dataset Errata)
Line 62: Line 62:
 
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy
 
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy
  
=== [[Dataset Errata]] ===
+
=== [[Dataset Errata]] ===
 +
 
 
*known / discovered errors in the .txt, .csv files as they are found.
 
*known / discovered errors in the .txt, .csv files as they are found.
'''Gold Parsing Errors'''
 
  
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl)
+
'''Gold Parsing Errors'''
 +
 
 +
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl)  
 +
 
 +
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl)
 +
 
 +
Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl)
 +
 
 +
Many Gold Parse Tennessee lichen labels have country errors. Examples:
 +
 
 +
-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)
 +
 
 +
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl)
 +
 
 +
<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:
 +
 
 +
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.
 +
 
 +
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963
 +
 
 +
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)
 +
 
 +
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)
 +
 
 +
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.
 +
 
 +
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period.
 +
 
 +
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).
 +
 
 +
Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.
 +
 
 +
Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.
 +
 
 +
Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"
 +
 
 +
Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.
 +
 
 +
Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...?
 +
 
 +
Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"
 +
 
 +
Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.
 +
 
 +
Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others.
 +
 
 +
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series.
 +
 
 +
'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)
 +
 
 +
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998
 +
 
 +
NY01075760_lg no datesetName
 +
 
 +
NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)
 +
 
 +
NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber
 +
 
 +
NY01075767_lg verbatimEventDate format
 +
 
 +
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)
 +
 
 +
NY01075768_lg country (canada), it hsould be (ca.)
 +
 
 +
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)
 +
 
 +
NY01075770_lg habitat (Host determined by A. R. Grant)
 +
 
 +
NY01075771_lg verbatimCoordinates mixed with verbatimLocality
 +
 
 +
NY01075779_lg habitat concatenation
 +
 
 +
NY01075780_lg NEW YOUR BOTANICAL GARDEN
 +
 
 +
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.
 +
 
 +
NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.
 +
 
 +
NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075818_lg no scientificName
 +
 
 +
NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
 +
 
 +
NY01075821_lg scientificName (null)
 +
 
 +
NY01075821_lg no scientificName
 +
 
 +
NY01075822_lg no scientificName
 +
 
 +
NY01075823_lg identifiedBy
 +
 
 +
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation
 +
 
 +
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).
 +
 
 +
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.
 +
 
 +
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.
 +
 
 +
TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))
 +
 
 +
TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)
 +
 
 +
TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))
 +
 
 +
TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))
 +
 
 +
TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)
 +
 
 +
TENN-L-0000021_lg verbatimInstitution ((TENN))
 +
 
 +
TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)
 +
 
 +
TENN-L-0000033_lg no catalogNumber in OCRed text file
 +
 
 +
TENN-L-0000036_lg verbatimEventDate (format)
 +
 
 +
TENN-L-0000036_lg verbatimEventDate (format)
 +
 
 +
TENN-L-0000045_lg recordNumber (null)
 +
 
 +
TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.
 +
 
 +
TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)
 +
 
 +
TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.
 +
 
 +
TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)
 +
 
 +
TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.
 +
 
 +
TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.
 +
 
 +
TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)
 +
 
 +
TENN-L-0000063_lg verbatimLocality contains scientific name
 +
 
 +
TENN-L-0000063_lg verbatimScientificName (Amherst)
 +
 
 +
TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)
 +
 
 +
TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)
 +
 
 +
TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)
 +
 
 +
TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.
 +
 
 +
TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.
 +
 
 +
TENN-L-0000077_lg identifiedBy (Date) in the csv file
 +
 
 +
TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.
 +
 
 +
TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)
 +
 
 +
TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)
 +
 
 +
TENN-L-0000084_lg scientificName (null)
 +
 
 +
TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file
 +
 
 +
TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.
 +
 
 +
WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.
 +
 
 +
WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.
 +
 
 +
WIS-L-0012026_lg no datasetName
 +
 
 +
WIS-L-0012038_lg no verbatimCoordinates
 +
 
 +
WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.
 +
 
 +
WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file
 +
 
 +
WIS-L-0012045_lg verbatimCoordinates concatenation
 +
 
 +
WIS-L-0012051_lg dateIdentified (format)
 +
 
 +
WIS-L-0012055_lg verbatimEventDate (format)
 +
 
 +
WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is (2003-July-19) in the csv file
 +
 
 +
WIS-L-0012056_lg dateIdentified (format)
 +
 
 +
WIS-L-0012057_lg no datesetName
 +
 
 +
WIS-L-0012064_lg verbatimCoordinates concatenation
 +
 
 +
WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file
 +
 
 +
WIS-L-0012074_lg county (null)
 +
 
 +
WIS-L-0012074_lg county (null)
 +
 
 +
WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates (Qianjin)
 +
 
 +
<br> '''Gold OCR Errors'''
 +
 
 +
NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.
 +
 
 +
WIS-L-0012026_lg.txt:: Several errors: Replaced the "N" in Latitude with a "K". Question mark instead of apostrophe in Longitude. Sandra Looman replace with Sandra Lcoman. Two dots after the date.
 +
 
 +
TENN-L-0000029_lg.txt adds a "1" to the scientificName ("Actinogyra muhlenbergii 1 (Ach.) Schol.").
 +
 
 +
NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u". We may want to do this, but if we do it should be standardized and consistent across all the labels. Same for NY01075791_lg.txt, and several others in the series.
 +
 
 +
<br> '''Silver Parsed CSV Files'''
 +
 
 +
"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)
 +
 
 +
 +
NY01075760_lg character encoding in verbatimScientificName
 +
typos in verbatimCoordinates
 +
 
 +
NY01075761_lg misspelling in verbatimScientificName
 +
 
 +
NY01075762_lg misspelling in habitat
 +
misspelling in verbatimLocality
 +
 
 +
NY01075764_lg misspelling in units for verbatimElevation
 +
 
 +
NY01075765_lg character encoding in verbatimScientificName
 +
removed extra period in verbatimEventDate
 +
 
 +
NY01075768_lg separated verbatimLocality data into two columns
 +
 
 +
NY01075769_lg misspelling in habitat
 +
 
 +
NY01075770_lg character encoding in verbatimScientificName
 +
character encoding in habitat
 +
 
 +
NY01075773_lg misspelling in verbatimScientificName
 +
misspelling in verbatimLocality
 +
 
 +
NY01075774_lg character encoding in verbatimScientificName
 +
 
 +
NY01075775_lg misspelling in country
 +
 
 +
NY01075776_lg character encoding in verbatimLocality
 +
 
 +
NY01075777_lg character encoding in country
 +
 
 +
NY01075779_lg character encoding in verbatimCoordinates
 +
 +
NY01075780_lg misspelling in verbatimInstitution
 +
misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
 
 +
NY01075781_lg character encoding in verbatimElevation
 +
 
 +
NY01075782_lg separated verbatimLocality data into two columns
 +
removed coordinates in verbatimLocality
 +
character encoding in habitat
 +
 
 +
NY01075786_lg misspelling in verbatimScientificName
 +
 
 +
NY01075787_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
misspelling in verbatimCoordinates
 +
misspelling in habitat
 +
 
 +
NY01075788_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
character encoding in verbatimCoordinates
 +
 
 +
NY01075789_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
character encoding in verbatimCoordinates
 +
 
 +
NY01075790_lg misspelling in habitat
 +
separated verbatimLocality data into three columns
 +
removed coordinates in verbatimLocality
 +
 
 +
NY01075791_lg character encoding in verbatimScientificName
 +
 
 +
NY01075792_lg misspelling in verbatimLocality
 +
 
 +
NY01075794_lg misspelling in verbatimLocality
 +
 
 +
NY01075795_lg misspelling in verbatimLocality
 +
 
 +
NY01075802_lg character encoding in verbatimScientificName
 +
 
 +
NY01075803_lg created new identifiedBy column
 +
created new verbatimScientificName column
 +
moved verbatimScientificName data from third row to new column
 +
 
 +
NY01075805_lg created new verbatimScientificName column
 +
moved verbatimScientificName data from third row to new column
 +
 
 +
NY01075806_lg character encoding in verbatimScientificName
 +
 
 +
NY01075813_lg misspelling in verbatimLocality
 +
 
 +
NY01075814_lg misspelling in county
 +
misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
misspelling in habitat
 +
 
 +
NY01075817_lg moved verbatimScientificName data to scientificName
 +
entered verbatimScientificName
 +
 
 +
NY01075818_lg misspelling in habitat
 +
 
 +
NY01075819_lg misspelling in recordedBy
 +
 
 +
NY01075821_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
added coordinates to verbatimCoordinates
 +
 
 +
NY01075822_lg removed coordinates in verbatimLocality
 +
 
 +
NY01075823_lg moved identifiedBy and dateIdentified data up one row
 +
created new verbatimScientificName column
 +
moved verbatimScientificName data from third row to new column
 +
 
 +
NY01075827_lg misspelling in county
 +
misspelling in verbatimLocality
 +
 
 +
NY01075828_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
 
 +
NY01075829_lg misspelling in habitat
 +
 
 +
NY01075831_lg misspelling in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
 
 +
NY01075837_lg misspelling in county
 +
 
 +
TENN-L-0000001_lg character encoding in occurrenceRemarks
 +
misspelling in habitat
 +
character encoding in verbatimLocality
 +
 
 +
TENN-L-0000002_lg character encoding in verbatimScientificName
 +
misspelling in habitat
 +
 
 +
TENN-L-0000004_lg misspelling in habitat
 +
misspelling in verbatimInstitution
 +
 
 +
TENN-L-0000005_lg misspelling in datasetName
 +
misspelling in occurrenceRemarks
 +
character encoding in verbatimLocality
 +
 
 +
TENN-L-0000006_lg misspelling in verbatimElevation
 +
edited verbatimEventDate
 +
 
 +
TENN-L-0000007_lg separated verbatimLocality into two columns
 +
misspellings in both verbatimLocality columns
 +
 
 +
TENN-L-0000009_lg character encoding in habitat
 +
character encoding in catalogNumber
 +
 
 +
TENN-L-0000010_lg separated verbatimLocality into two columns
 +
 
 +
TENN-L-0000012_lg character encoding in datasetName
 +
character encoding in occurrenceRemarks
 +
 
 +
TENN-L-0000013_lg misspelling in occurrenceRemarks
 +
misspelling in verbatimLocality
 +
 
 +
TENN-L-0000014_lg misspelling in datasetName
 +
misspelling in fieldNotes
 +
character encoding in verbatimLocality
 +
separated recordedBy into two columns
 +
 
 +
TENN-L-0000022_lg character encoding in recordedBy
 +
 
 +
TENN-L-0000027_lg character encoding in verbatimScientificName
 +
 
 +
TENN-L-0000028_lg character encoding in verbatimScientificName
  
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters.  Therefore, it should not be expressed as "750 m", but rather as "750".  verbatimElevation, of course, should retain the "m" if it was present on the label.  (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.)  Not sure if this is something to change on the labels, but worth being aware of.  I think parsing programs should generate the Darwin Core fields. (Daryl)
+
TENN-L-0000029_lg misspelling in recordedBy
  
Inconsistency in the Gold Parsed labels for Country.  If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA.  Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label.  I think it is valid to fill it in, but it should be consistent. (Daryl)
+
TENN-L-0000032_lg character encoding in verbatimScientificName
  
Many Gold Parse Tennessee lichen labels have country errors.  Examples:
+
TENN-L-0000033_lg separated dataSetName into two columns
 +
separated fieldNotes into two columns
  
-- Gold Parsed TENN-L-0000001_lg.csv lists country as  "USA", but on the .txt label, it is "U.S.A." (with periods).  Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)
+
TENN-L-0000041_lg character encoding in datasetName
 +
misspelling in verbatimLocality
  
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA".  Again, maybe this is OK, but it should be consistent.  (Daryl)
+
TENN-L-0000044_lg character encoding in datasetName
  
 +
TENN-L-0000045_lg separated verbatimLocality into two columns
  
Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified.  Examples:
+
TENN-L-0000046_lg character encoding in datasetName
  
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format:  Verbatim would be:  Nov. 12, 1939, DarwinCore would be: 1939-11-12,  Listed is:  1939-November-12.
+
TENN-L-0000047_lg character encoding in datasetName
  
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963
+
TENN-L-0000048_lg misspelling in verbatimLocality
  
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)
+
TENN-L-0000049_lg separated verbatimLocality into two columns
  
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates:  38°42'20"N, 83°08'25'W is rendered as 38°42'20""N  83°08'25""W.  (Note also that the double quote is replaced with two double quotes.  This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database.  Not presented here as an error, but we should be aware of possible implications.)
+
TENN-L-0000051_lg character encoding in verbatimLocality
  
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.
+
TENN-L-0000052_lg character encoding in verbatimScientificName
 +
character encoding in datasetName
 +
character encoding in habitat
 +
character encoding in verbatimLocality
 +
character encoding in recordedBy
  
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field.  Example:  verbatimCoordinates in NY01075782_lg.csv includes the period at the end.  NY01075780_lg.csv does not include the period.
+
TENN-L-0000053_lg character encoding in recordNumber
  
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576.  The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).
+
TENN-L-0000054_lg character encoding in datasetName
  
Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.
+
TENN-L-0000056_lg edited recordedBy
  
Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.
+
TENN-L-0000057_lg character encoding in verbatimLocality
 +
misspelling in verbatimInstitution
  
Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"
+
TENN-L-0000058_lg separated dataSetName into two columns
 +
character encoding in verbatimInstitution
  
Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.
+
TENN-L-0000059_lg character encoding in stateProvince
 +
character encoding in verbatimScientificName
 +
character encoding in verbatimCoordinates
 +
misspelling in recordedBy
  
Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels.  Probably should be "Peru" on both...?
+
TENN-L-0000061_lg edited verbatimLocality
 +
misspelling in recordedBy
  
Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"
+
TENN-L-0000063_lg separated dataSetName into two columns
 +
character encoding in identificationRemarks
  
Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.
+
TENN-L-0000064_lg character encoding in verbatimScientificName
  
Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N.  The space removal is inconsistent, on some labels, not on others.
+
TENN-L-0000065_lg character encoding in verbatimScientificName
  
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll".  This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull".  Same for NY01075792_lg.csv, and several other in the series.
+
TENN-L-0000068_lg edited habitat
 +
character encoding in verbatimInstitution
  
'''Gold Parsed CSV Files'''
+
TENN-L-0000072_lg separated verbatimLocality into two columns
There are more errors in gold csv files. (Qianjin)
+
misspelling in country
 +
edited verbatimScientificName
 +
character encoding in verbatimInstitution
  
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998
+
TENN-L-0000073_lg misspelling in verbatimLocality
 +
character encoding in verbatimCoordinates
 +
misspelling in recordedBy
  
NY01075760_lg no datesetName
+
TENN-L-0000074_lg character encoding in recordedBy
 +
character encoding in verbatimScientificName
 +
character encoding in verbatimLocality
  
NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)
+
TENN-L-0000075_lg character encoding in datasetName
 +
misspelling in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in both verbatimLocality columns
 +
character encoding in verbatimCoordinates
  
NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber
+
TENN-L-0000076_lg misspelling in datasetName
 +
character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in both verbatimLocality columns
 +
character encoding in recordedBy
  
NY01075767_lg verbatimEventDate format
+
TENN-L-0000077_lg character encoding in county
 +
character encoding in verbatimLocality
 +
character encoding in catalogNumber
  
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)
+
TENN-L-0000079_lg character encoding in verbatimInstitution
  
NY01075768_lg country (canada), it hsould be (ca.)
+
TENN-L-0000080_lg character encoding in catalogNumber
  
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)
+
TENN-L-0000083_lg character encoding in verbatimScientificName
  
NY01075770_lg habitat (Host determined by A. R. Grant)
+
TENN-L-0000084_lg character encoding in datasetName
 +
character encoding in verbatimScientificName
 +
character encoding in verbatimLocality
  
NY01075771_lg verbatimCoordinates mixed with verbatimLocality
+
TENN-L-0000087_lg character encoding in recordNumber
 +
character encoding in habitat
 +
character encoding in verbatimLocality
 +
character encoding in verbatimInstitution
  
NY01075779_lg habitat concatenation
+
TENN-L-0000089_lg misspelling in country
 +
separated verbatimLocality into two columns
 +
misspelling in verbatimLocality
 +
misspelling in verbatimInstitution
 +
misspelling in datasetName
  
NY01075780_lg NEW YOUR BOTANICAL GARDEN
+
TENN-L-0000090_lg character encoding in verbatimInstitution
  
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.
+
TENN-L-0000091_lg character encoding in datasetName
 +
character encoding in verbatimScientificName
 +
character encoding in catalogNumber
  
NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
TENN-L-0000093_lg edited verbatimLocality
 +
character encoding in catalogNumber
  
NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.
+
TENN-L-0000095_lg character encoding in verbatimScientificName
 +
edited country
 +
character encoding in verbatimLocality
  
NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
TENN-L-0000097_lg character encoding in verbatimScientificName
  
NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
TENN-L-0000098_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimEventDate
 +
character encoding in verbatimCoordinates
  
NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
TENN-L-0000099_lg separated dataSetName into two columns
 +
character encoding in stateProvince
 +
misspelling in verbatimScientificName
 +
character encoding in verbatimLocality
 +
character encoding in verbatimLatitude
 +
character encoding in catalogNumber
  
NY01075818_lg no scientificName
+
WIS-L-0011726_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
misspelling in verbatimElevation
 +
character encoding in recordedBy
  
NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
WIS-L-0011727_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
misspelling in verbatimLocality
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
+
WIS-L-0011728_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
NY01075821_lg scientificName (null)
+
WIS-L-0011729_lg separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
NY01075821_lg no scientificName
+
WIS-L-0011730_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
misspelling in habitat
  
NY01075822_lg no scientificName
+
WIS-L-0011731_lg character encoding in verbatimScientificName
 +
character encoding in identifiedBy
 +
separated verbatimLocality into two columns
 +
misspelling in associatedTaxa
 +
misspelling in verbatimElevation
  
NY01075823_lg identifiedBy
+
WIS-L-0011732_lg separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation
+
WIS-L-0011733_lg character encoding in verbatimLocality
 +
character encoding in habitat
 +
character encoding in verbatimCoordinates
  
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).
+
WIS-L-0011734_lg character encoding in verbatimScientificName
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
 +
character encoding in recordNumber
 +
separated verbatimLocality into two columns
  
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.
+
WIS-L-0011736_lg character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file. 
+
WIS-L-0012025_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))
+
WIS-L-0012026_lg character encoding in verbatimScientificName
  
TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)
+
WIS-L-0012027_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))
+
WIS-L-0012028_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
 +
misspelling in verbatimElevation
  
TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))
+
WIS-L-0012029_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)
+
WIS-L-0012030_lg character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000021_lg verbatimInstitution ((TENN))
+
WIS-L-0012031_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)
 
  
TENN-L-0000033_lg no catalogNumber in OCRed text file
+
WIS-L-0012032_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000036_lg verbatimEventDate (format)
+
WIS-L-0012033_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
misspelling in verbatimElevation
  
TENN-L-0000036_lg verbatimEventDate (format)
+
WIS-L-0012034_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000045_lg recordNumber (null)
+
WIS-L-0012035_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
misspelling in verbatimLocality
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.
+
WIS-L-0012036_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)
+
WIS-L-0012037_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.
+
WIS-L-0012039_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLocality
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)
+
WIS-L-0012040_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.
+
WIS-L-0012041_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.
+
WIS-L-0012042_lg character encoding in datasetName
 +
character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)
+
WIS-L-0012043_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
 +
separated verbatimLocality into two columns
  
TENN-L-0000063_lg verbatimLocality contains scientific name
+
WIS-L-0012044_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000063_lg verbatimScientificName (Amherst)
+
WIS-L-0012045_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)
+
WIS-L-0012046_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)
+
WIS-L-0012047_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)
+
WIS-L-0012048_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.
+
WIS-L-0012049_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.
+
WIS-L-0012050_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000077_lg identifiedBy (Date) in the csv file
+
WIS-L-0012051_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.
+
WIS-L-0012052_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)
+
WIS-L-0012053_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)
+
WIS-L-0012054_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000084_lg scientificName (null)
+
WIS-L-0012055_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file
+
WIS-L-0012056_lg separated verbatimLocality into two columns
 +
character encoding in habitat
  
TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.
+
WIS-L-0012057_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimCoordinates
  
WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.
+
WIS-L-0012058_lg separated verbatimLocality into two columns
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.
+
WIS-L-0012059_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
WIS-L-0012026_lg no datasetName
+
WIS-L-0012060_lg character encoding in verbatimScientificName
 +
character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
  
WIS-L-0012038_lg no verbatimCoordinates
+
WIS-L-0012061_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
removed coordinates in verbatimLocality
 +
character encoding in associatedTaxa
  
WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.
+
WIS-L-0012062_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
 +
character encoding in habitat
 +
character encoding in verbatimInstitution
  
WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file
+
WIS-L-0012063_lg character encoding in verbatimScientificName
 +
character encoding in verbatimLocality
 +
removed coordinates in verbatimLocality
 +
character encoding in associatedTaxa
  
WIS-L-0012045_lg verbatimCoordinates concatenation
+
WIS-L-0012064_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
WIS-L-0012051_lg dateIdentified (format)
+
WIS-L-0012065_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in habitat
  
WIS-L-0012055_lg verbatimEventDate (format)
+
WIS-L-0012067_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is  (2003-July-19) in the csv file
+
WIS-L-0012068_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
WIS-L-0012056_lg dateIdentified (format)
+
WIS-L-0012069_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
WIS-L-0012057_lg no datesetName
+
WIS-L-0012070_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
WIS-L-0012064_lg verbatimCoordinates concatenation
+
WIS-L-0012071_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
removed coordinates in verbatimLocality
 +
character encoding in associatedTaxa
  
WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file
+
WIS-L-0012073_lg character encoding in verbatimCoordinates
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
  
WIS-L-0012074_lg county (null)
+
WIS-L-0012074_lg character encoding in verbatimCoordinates
 +
character encoding in habitat
 +
misspelling in verbatimLocality
  
WIS-L-0012074_lg county (null)
+
WIS-L-0012075_lg character encoding in verbatimScientificName
 +
character encoding in verbatimCoordinates
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
  
WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates (Qianjin)
+
WIS-L-0012076_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimCoordinates
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimLongitude
  
 +
WIS-L-0012077_lg character encoding in verbatimLocality
 +
character encoding in habitat
  
'''Gold OCR Errors'''
+
WIS-L-0012078_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.
+
WIS-L-0012082_lg character encoding in verbatimScientificName
 +
character encoding in verbatimCoordinates
 +
character encoding in verbatimLatitude
 +
character encoding in verbatimEventDate
 +
character encoding in recordNumber
  
WIS-L-0012026_lg.txt::  Several errors:  Replaced the "N" in Latitude with a "K".  Question mark instead of apostrophe in Longitude.  Sandra Looman replace with Sandra Lcoman.  Two dots after the date.
+
WIS-L-0012084_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
TENN-L-0000029_lg.txt adds a "1" to the scientificName ("Actinogyra muhlenbergii 1 (Ach.) Schol.").
+
WIS-L-0012085_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
 +
character encoding in verbatimLongitude
 +
character encoding in verbatimCoordinates
  
NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u".  We may want to do this, but if we do it should be standardized and consistent across all the labels.  Same for NY01075791_lg.txt, and several others in the series.
+
WIS-L-0012086_lg character encoding in verbatimScientificName
 +
separated verbatimLocality into two columns
  
 
== Parameters ==
 
== Parameters ==

Revision as of 23:03, 24 March 2013

The 2013 AOCR Challenge

The Challenge

One of the most significant areas of interest for improving the utilization of OCR output is parsing. Digitization and data curation and dissemination of biodiversity museum collections specimen data can be sped up if the output from OCR can be parsed faster and more accurately and packaged into semantically meaningful units for insertion into a database.

The Specific Task

Given a set of images, parse existing OCR output or repeat the OCR with the software of choice and then parse the new OCR output attempting to successfully populate as many of the selected Darwin Core (and other) data elements as possible into a CSV file. These participant-generated CSV files will be compared against human hand-parsed gold and silver CSV files.

Three Data Sets

There are three data sets, that is, three different sets of images of museum specimen labels. Participants, working alone or in groups, may work on one or more data sets as they choose. The sets have been ranked, easy, medium, hard, as an estimate of how difficult it might be to successfully get good parsed data from the OCR output from each data set.

Set 1 (easy) 
10,000 images of Lichens, Bryophyte and Climate Change TCN, lichen and bryophyte packet labels. These are considered easy because these jpg images are of the label only and data on the label is mostly typed or printed with little or no handwriting present.
Set 2 (medium) 
5,000 Botanical Research Institute of Texas (BRIT) Herbarium and 5,000 New York Botanical Garden Herbarium specimen sheets. These are full sheets and again, most have been pre-selected to focus on labels containing mostly print or typed text and little handwriting. Note there are exceptions in order to make a more realistic (and more difficult) data set.
Set 3 (hard)
Several thousand images from the Essig Museum and the CalBug project. The gold set has not yet been created for these (in progress). Silver set creation needs to be discussed.

The Process

For each of the three image data sets, 200 images were selected (hand-picked) for creating a human hand-parsed standard for metrics. Three different files have been created for each of these selected images.

Perfect OCR text files 
Hand-transcribed from each image, these text files represent faithfully (exactly) what is in the image and are supposed to reflect what the output would look like if the OCR understood all the data in the image (including the handwriting).
Gold CSV files 
These Gold CSV files have darwin core element column headers and the data parsed into the appropriate column. Data to populate these Gold CSV files comes from the hand-transcribed gold text files.
Silver CSV files 
These Silver CSV files also have the same darwin core element column headers and the data parsed into the appropriate column. But, the data here is from the OCR output "as is." The same data, with any OCR errors, from the same images is now captured and put into each silver CSV.

Accessing the Data Sets

  • An AOCR VM is set up for all participants.
    • host server name: aocr1.acis.ufl.edu
    • user name and password given to you at our first meeting and via email.
  • Software and configuration
    • services: ftp, ssh, mysql, apache
    • ocr software: tesseract, jocr (gocr), ocropus, imagemagik, zbar
    • Mysql username and password is the same as the login, database is aocr.
    • Apache root directory is /home/aocr/webroot
  • Sample of what you will see there for Set 1 (LBCC TCN lichen bryophyte packet labels):
human hand-typed image data (no errors) into text file == gold.txt 
sample: ~/datasets/lichens/gold/outputs
human parses data from gold.txt files into gold csv file (darwin core fields) == gold.csv 
sample: ~/datasets/lichens/gold/parsed
OCR (of choice, ABBYY, TESSERACT, GOCR/JOCR, OCRopus, Omnipage) run on these images = output to silver.txt files 
sample: ~/datasets/lichens/silver/outputs
human parses "dirty" OCR out of silver.txt in to same darwin core fields ==silver.csv 
sample: ~/datasets/lichens/silver/parsed

Image Data Sets on the AOCR VM

Data set 1 Lichen Images
/home/aocr/datasets/lichens/inputs/raw
Data set 1 Lichen OCR output text files for parsing
/home/aocr/datasets/lichens/outputs/tesseract
/home/aocr/datasets/lichens/outputs/abby
/home/aocr/datasets/lichens/outputs/gocr
/home/aocr/datasets/lichens/outputs/ocrad
/home/aocr/datasets/lichens/outputs/ocropus
Data set 1 Lichen Authority Files
/home/aocr/datasets/lichens/authorityfiles
Data set 2 Herbarium Sheet Images
10000+ images in /home/aocr/datasets/herbs/inputs/raw
5000 are from NYBG in home/aocr/sgottschalk_images.tar.gz
Data set 2 Herbarium Sheet OCR output text files for parsing
/home/aocr/datasets/herbs/outputs/gocr
/home/aocr/datasets/herbs/outputs/ocrad
/home/aocr/datasets/herbs/outputs/ocropus
/home/aocr/datasets/herbs/outputs/tesseract
SAMPLE IMAGE parsed in the SAMPLE CSV next.
SAMPLE PARSED CSV FILE to show column headers and values
Data set 3 Entomology Images
/home/aocr/datasets/ent/inputs/raw
/home/aocr/oboyski
or see /home/aocr/oboyski_images.tar.gz
Data set 3 Entomology OCR output ABBYY text files for parsing
/home/aocr/datasets/ent/outputs/abbyy

Dataset Errata

  • known / discovered errors in the .txt, .csv files as they are found.

Gold Parsing Errors

Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl)

This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl)

Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl)

Many Gold Parse Tennessee lichen labels have country errors. Examples:

-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)

-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl)


Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:

-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.

-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963

-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)

Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)

Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.

Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period.

Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).

Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.

Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.

Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"

Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.

Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...?

Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"

Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.

Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others.

Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series.

Gold Parsed CSV Files There are more errors in gold csv files. (Qianjin)

NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998

NY01075760_lg no datesetName

NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)

NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber

NY01075767_lg verbatimEventDate format

NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)

NY01075768_lg country (canada), it hsould be (ca.)

NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)

NY01075770_lg habitat (Host determined by A. R. Grant)

NY01075771_lg verbatimCoordinates mixed with verbatimLocality

NY01075779_lg habitat concatenation

NY01075780_lg NEW YOUR BOTANICAL GARDEN

NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.

NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.

NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075818_lg no scientificName

NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

NY01075821_lg scientificName (null)

NY01075821_lg no scientificName

NY01075822_lg no scientificName

NY01075823_lg identifiedBy

TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation

TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).

TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.

TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.

TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))

TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)

TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))

TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))

TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)

TENN-L-0000021_lg verbatimInstitution ((TENN))

TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)

TENN-L-0000033_lg no catalogNumber in OCRed text file

TENN-L-0000036_lg verbatimEventDate (format)

TENN-L-0000036_lg verbatimEventDate (format)

TENN-L-0000045_lg recordNumber (null)

TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.

TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)

TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.

TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)

TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.

TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.

TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)

TENN-L-0000063_lg verbatimLocality contains scientific name

TENN-L-0000063_lg verbatimScientificName (Amherst)

TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)

TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)

TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)

TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.

TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.

TENN-L-0000077_lg identifiedBy (Date) in the csv file

TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.

TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)

TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)

TENN-L-0000084_lg scientificName (null)

TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file

TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.

WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.

WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.

WIS-L-0012026_lg no datasetName

WIS-L-0012038_lg no verbatimCoordinates

WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.

WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file

WIS-L-0012045_lg verbatimCoordinates concatenation

WIS-L-0012051_lg dateIdentified (format)

WIS-L-0012055_lg verbatimEventDate (format)

WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is (2003-July-19) in the csv file

WIS-L-0012056_lg dateIdentified (format)

WIS-L-0012057_lg no datesetName

WIS-L-0012064_lg verbatimCoordinates concatenation

WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file

WIS-L-0012074_lg county (null)

WIS-L-0012074_lg county (null)

WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates (Qianjin)


Gold OCR Errors

NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.

WIS-L-0012026_lg.txt:: Several errors: Replaced the "N" in Latitude with a "K". Question mark instead of apostrophe in Longitude. Sandra Looman replace with Sandra Lcoman. Two dots after the date.

TENN-L-0000029_lg.txt adds a "1" to the scientificName ("Actinogyra muhlenbergii 1 (Ach.) Schol.").

NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u". We may want to do this, but if we do it should be standardized and consistent across all the labels. Same for NY01075791_lg.txt, and several others in the series.


Silver Parsed CSV Files

"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)


NY01075760_lg character encoding in verbatimScientificName typos in verbatimCoordinates

NY01075761_lg misspelling in verbatimScientificName

NY01075762_lg misspelling in habitat misspelling in verbatimLocality

NY01075764_lg misspelling in units for verbatimElevation

NY01075765_lg character encoding in verbatimScientificName removed extra period in verbatimEventDate

NY01075768_lg separated verbatimLocality data into two columns

NY01075769_lg misspelling in habitat

NY01075770_lg character encoding in verbatimScientificName character encoding in habitat

NY01075773_lg misspelling in verbatimScientificName misspelling in verbatimLocality

NY01075774_lg character encoding in verbatimScientificName

NY01075775_lg misspelling in country

NY01075776_lg character encoding in verbatimLocality

NY01075777_lg character encoding in country

NY01075779_lg character encoding in verbatimCoordinates

NY01075780_lg misspelling in verbatimInstitution misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075781_lg character encoding in verbatimElevation

NY01075782_lg separated verbatimLocality data into two columns removed coordinates in verbatimLocality character encoding in habitat

NY01075786_lg misspelling in verbatimScientificName

NY01075787_lg misspelling in verbatimLocality removed coordinates in verbatimLocality misspelling in verbatimCoordinates misspelling in habitat

NY01075788_lg misspelling in verbatimLocality removed coordinates in verbatimLocality character encoding in verbatimCoordinates

NY01075789_lg misspelling in verbatimLocality removed coordinates in verbatimLocality character encoding in verbatimCoordinates

NY01075790_lg misspelling in habitat separated verbatimLocality data into three columns removed coordinates in verbatimLocality

NY01075791_lg character encoding in verbatimScientificName

NY01075792_lg misspelling in verbatimLocality

NY01075794_lg misspelling in verbatimLocality

NY01075795_lg misspelling in verbatimLocality

NY01075802_lg character encoding in verbatimScientificName

NY01075803_lg created new identifiedBy column created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075805_lg created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075806_lg character encoding in verbatimScientificName

NY01075813_lg misspelling in verbatimLocality

NY01075814_lg misspelling in county misspelling in verbatimLocality removed coordinates in verbatimLocality misspelling in habitat

NY01075817_lg moved verbatimScientificName data to scientificName entered verbatimScientificName

NY01075818_lg misspelling in habitat

NY01075819_lg misspelling in recordedBy

NY01075821_lg misspelling in verbatimLocality removed coordinates in verbatimLocality added coordinates to verbatimCoordinates

NY01075822_lg removed coordinates in verbatimLocality

NY01075823_lg moved identifiedBy and dateIdentified data up one row created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075827_lg misspelling in county misspelling in verbatimLocality

NY01075828_lg misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075829_lg misspelling in habitat

NY01075831_lg misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075837_lg misspelling in county

TENN-L-0000001_lg character encoding in occurrenceRemarks misspelling in habitat character encoding in verbatimLocality

TENN-L-0000002_lg character encoding in verbatimScientificName misspelling in habitat

TENN-L-0000004_lg misspelling in habitat misspelling in verbatimInstitution

TENN-L-0000005_lg misspelling in datasetName misspelling in occurrenceRemarks character encoding in verbatimLocality

TENN-L-0000006_lg misspelling in verbatimElevation edited verbatimEventDate

TENN-L-0000007_lg separated verbatimLocality into two columns misspellings in both verbatimLocality columns

TENN-L-0000009_lg character encoding in habitat character encoding in catalogNumber

TENN-L-0000010_lg separated verbatimLocality into two columns

TENN-L-0000012_lg character encoding in datasetName character encoding in occurrenceRemarks

TENN-L-0000013_lg misspelling in occurrenceRemarks misspelling in verbatimLocality

TENN-L-0000014_lg misspelling in datasetName misspelling in fieldNotes character encoding in verbatimLocality separated recordedBy into two columns

TENN-L-0000022_lg character encoding in recordedBy

TENN-L-0000027_lg character encoding in verbatimScientificName

TENN-L-0000028_lg character encoding in verbatimScientificName

TENN-L-0000029_lg misspelling in recordedBy

TENN-L-0000032_lg character encoding in verbatimScientificName

TENN-L-0000033_lg separated dataSetName into two columns separated fieldNotes into two columns

TENN-L-0000041_lg character encoding in datasetName misspelling in verbatimLocality

TENN-L-0000044_lg character encoding in datasetName

TENN-L-0000045_lg separated verbatimLocality into two columns

TENN-L-0000046_lg character encoding in datasetName

TENN-L-0000047_lg character encoding in datasetName

TENN-L-0000048_lg misspelling in verbatimLocality

TENN-L-0000049_lg separated verbatimLocality into two columns

TENN-L-0000051_lg character encoding in verbatimLocality

TENN-L-0000052_lg character encoding in verbatimScientificName character encoding in datasetName character encoding in habitat character encoding in verbatimLocality character encoding in recordedBy

TENN-L-0000053_lg character encoding in recordNumber

TENN-L-0000054_lg character encoding in datasetName

TENN-L-0000056_lg edited recordedBy

TENN-L-0000057_lg character encoding in verbatimLocality misspelling in verbatimInstitution

TENN-L-0000058_lg separated dataSetName into two columns character encoding in verbatimInstitution

TENN-L-0000059_lg character encoding in stateProvince character encoding in verbatimScientificName character encoding in verbatimCoordinates misspelling in recordedBy

TENN-L-0000061_lg edited verbatimLocality misspelling in recordedBy

TENN-L-0000063_lg separated dataSetName into two columns character encoding in identificationRemarks

TENN-L-0000064_lg character encoding in verbatimScientificName

TENN-L-0000065_lg character encoding in verbatimScientificName

TENN-L-0000068_lg edited habitat character encoding in verbatimInstitution

TENN-L-0000072_lg separated verbatimLocality into two columns misspelling in country edited verbatimScientificName character encoding in verbatimInstitution

TENN-L-0000073_lg misspelling in verbatimLocality character encoding in verbatimCoordinates misspelling in recordedBy

TENN-L-0000074_lg character encoding in recordedBy character encoding in verbatimScientificName character encoding in verbatimLocality

TENN-L-0000075_lg character encoding in datasetName misspelling in verbatimScientificName separated verbatimLocality into two columns character encoding in both verbatimLocality columns character encoding in verbatimCoordinates

TENN-L-0000076_lg misspelling in datasetName character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in both verbatimLocality columns character encoding in recordedBy

TENN-L-0000077_lg character encoding in county character encoding in verbatimLocality character encoding in catalogNumber

TENN-L-0000079_lg character encoding in verbatimInstitution

TENN-L-0000080_lg character encoding in catalogNumber

TENN-L-0000083_lg character encoding in verbatimScientificName

TENN-L-0000084_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in verbatimLocality

TENN-L-0000087_lg character encoding in recordNumber character encoding in habitat character encoding in verbatimLocality character encoding in verbatimInstitution

TENN-L-0000089_lg misspelling in country separated verbatimLocality into two columns misspelling in verbatimLocality misspelling in verbatimInstitution misspelling in datasetName

TENN-L-0000090_lg character encoding in verbatimInstitution

TENN-L-0000091_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in catalogNumber

TENN-L-0000093_lg edited verbatimLocality character encoding in catalogNumber

TENN-L-0000095_lg character encoding in verbatimScientificName edited country character encoding in verbatimLocality

TENN-L-0000097_lg character encoding in verbatimScientificName

TENN-L-0000098_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimEventDate character encoding in verbatimCoordinates

TENN-L-0000099_lg separated dataSetName into two columns character encoding in stateProvince misspelling in verbatimScientificName character encoding in verbatimLocality character encoding in verbatimLatitude character encoding in catalogNumber

WIS-L-0011726_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in verbatimElevation character encoding in recordedBy

WIS-L-0011727_lg character encoding in verbatimScientificName separated verbatimLocality into two columns misspelling in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011728_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0011729_lg separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011730_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in habitat

WIS-L-0011731_lg character encoding in verbatimScientificName character encoding in identifiedBy separated verbatimLocality into two columns misspelling in associatedTaxa misspelling in verbatimElevation

WIS-L-0011732_lg separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011733_lg character encoding in verbatimLocality character encoding in habitat character encoding in verbatimCoordinates

WIS-L-0011734_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in habitat character encoding in recordNumber separated verbatimLocality into two columns

WIS-L-0011736_lg character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012025_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012026_lg character encoding in verbatimScientificName

WIS-L-0012027_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012028_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat misspelling in verbatimElevation

WIS-L-0012029_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012030_lg character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012031_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat


WIS-L-0012032_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012033_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in verbatimElevation

WIS-L-0012034_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012035_lg character encoding in verbatimScientificName separated verbatimLocality into two columns misspelling in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012036_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012037_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012039_lg character encoding in verbatimScientificName character encoding in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012040_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012041_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012042_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012043_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat separated verbatimLocality into two columns

WIS-L-0012044_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012045_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012046_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012047_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012048_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012049_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012050_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012051_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012052_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012053_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012054_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012055_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012056_lg separated verbatimLocality into two columns character encoding in habitat

WIS-L-0012057_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimCoordinates

WIS-L-0012058_lg separated verbatimLocality into two columns character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012059_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012060_lg character encoding in verbatimScientificName character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012061_lg character encoding in verbatimScientificName separated verbatimLocality into two columns removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012062_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat character encoding in verbatimInstitution

WIS-L-0012063_lg character encoding in verbatimScientificName character encoding in verbatimLocality removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012064_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012065_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in habitat

WIS-L-0012067_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012068_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012069_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012070_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012071_lg character encoding in verbatimScientificName separated verbatimLocality into two columns removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012073_lg character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012074_lg character encoding in verbatimCoordinates character encoding in habitat misspelling in verbatimLocality

WIS-L-0012075_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012076_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012077_lg character encoding in verbatimLocality character encoding in habitat

WIS-L-0012078_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012082_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimEventDate character encoding in recordNumber

WIS-L-0012084_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012085_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012086_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

Parameters

  • Parsers should produce at least CSV format output where the column headers are Darwin core (http://rs.tdwg.org/dwc/terms/) elements with some extended element names.
    • List of target core data elements
    • The full set of valid categories is defined in a definition document in the parsing directory of the A-OCR virtual machine.
  • All of this information needs to be classified on the label so that it can be imported to a database and shared with others over the Internet. The input to the parsing process is OCR text.
  • For the hackathon there will be at least 600 examples of OCR text, in 3 groups of 200, that have been previously properly classified/parsed by humans.
    • This parsed text may be used for training some learning algorithms.
    • This set will also be used for evaluation of performance of parsing algorithms.
  • Overfitting is a potential problem so at the time of the hackathon we may provide additional testing records for evaluation.

Scope

  • There are several potential types of input to the parsing algorithms.
    • The most basic form of input is OCR text in UTF-8 format from multiple engines.
    • There may optionally be OCR with exact spatial information about the location of characters on the original image.
      • This will allow some algorithms to exploit spatial information to identify elements. This format is, however, not a main focus for this hackathon.
  • Some data dictionaries and authority files may be provided (or you may use those you have access to) in efforts to have cleaner OCR output before parsing.
    • Lichen authority files can be found in: ~/datasets/lichens/authorityfiles/
  • Those wishing to pursue other goals such as image segmentation, finding specific elements, or improving usability & user interfaces to the OCR output and parsing tools are encouraged to do so and report back to the group at the hackathon.

Metrics and Evaluation

  • CSV files generated by participants will be compared with CSV files created by humans.
  • Metrics evaluation code will be in javascript or phython written by Alex Thompson (iDigBio IT). The evaluation code can then be run by participants as desired in sequential attempts to improve the result.
    • a Presence-Absence matrix
    • Confusion Matrix
    • F-Score (weighs correct / incorrect answers)
    • Time needed to generate CSV output from running algorithms
      • (MG) suggested adding this metric if possible, at our 11 Jan 2013 virtual meeting.
  • Graphics may be created
    • For example, with an F-score for each dwc element entry, we can generate a graph / histogram across all participants.
Evaluation 
We will attempt to provide services that can validate the outcomes of hackathon deliverables. This hackathon is not structured as a competition, but we felt it would be beneficial for participants to have some baseline to evaluate the effectiveness of their methods.
OCR Text Evaluation 
Evaluation of OCR Output will be based on a comparison to Gold Hand-Typed outputs, using confusion matrix like criteria for evaluating word presence, word correctness, and avoiding non-text garbage regions. We will attempt to avoid penalizing for attempts at text recognition in barcode and handwritten regions.
Parsed Field Evaluation 
Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.

Back to the Hackathon Wiki