Hackathon Challenge: Difference between revisions

Jump to navigation Jump to search
 
(24 intermediate revisions by 5 users not shown)
Line 54: Line 54:
::::/home/aocr/datasets/herbs/outputs/ocropus
::::/home/aocr/datasets/herbs/outputs/ocropus
::::/home/aocr/datasets/herbs/outputs/tesseract
::::/home/aocr/datasets/herbs/outputs/tesseract
:::;'''[https://www.idigbio.org/wiki/images/e/e4/01498198.jpg SAMPLE IMAGE]''' parsed in the SAMPLE CSV next.
:::;'''[[Media:01498198.jpg|SAMPLE IMAGE]]''' parsed in the SAMPLE CSV next.
:::;'''[https://www.idigbio.org/wiki/images/6/62/SampleCSV.jpg SAMPLE PARSED CSV FILE]''' to show '''column headers and values'''
:::;'''[[Media:SampleCSV.jpg|SAMPLE PARSED CSV FILE]]''' to show '''column headers and values'''


:::;Data set 3 Entomology Images: /home/aocr/datasets/ent/inputs/raw
:::;Data set 3 Entomology Images: /home/aocr/datasets/ent/inputs/raw
Line 62: Line 62:
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy
:::;Data set 3 Entomology OCR output ABBYY text files for parsing: /home/aocr/datasets/ent/outputs/abbyy


=== [[Dataset Errata]] ===
=== [[Dataset Errata]] ===
 
*known / discovered errors in the .txt, .csv files as they are found.
*known / discovered errors in the .txt, .csv files as they are found.
'''Gold Parsing Errors'''
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude.  This seems especially true for the New York labels. (Daryl)
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters.  Therefore, it should not be expressed as "750 m", but rather as "750".  verbatimElevation, of course, should retain the "m" if it was present on the label.  (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.)  Not sure if this is something to change on the labels, but worth being aware of.  I think parsing programs should generate the Darwin Core fields. (Daryl)
Inconsistency in the Gold Parsed labels for Country.  If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA.  Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label.  I think it is valid to fill it in, but it should be consistent. (Daryl)
Many Gold Parse Tennessee lichen labels have country errors.  Examples:
-- Gold Parsed TENN-L-0000001_lg.csv lists country as  "USA", but on the .txt label, it is "U.S.A." (with periods).  Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA".  Again, maybe this is OK, but it should be consistent.  (Daryl)
Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified.  Examples:
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format:  Verbatim would be:  Nov. 12, 1939, DarwinCore would be: 1939-11-12,  Listed is:  1939-November-12.
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates:  38°42'20"N, 83°08'25'W is rendered as 38°42'20""N  83°08'25""W.  (Note also that the double quote is replaced with two double quotes.  This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database.  Not presented here as an error, but we should be aware of possible implications.)
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field.  Example:  verbatimCoordinates in NY01075782_lg.csv includes the period at the end.  NY01075780_lg.csv does not include the period.
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576.  The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).
'''Gold OCR Errors'''
NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.


== Parameters ==
== Parameters ==
5,887

edits

Navigation menu