Hackathon Challenge: Difference between revisions

Line 122: Line 122:


NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u".  We may want to do this, but if we do it should be standardized and consistent across all the labels.  Same for NY01075791_lg.txt, and several others in the series.
NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u".  We may want to do this, but if we do it should be standardized and consistent across all the labels.  Same for NY01075791_lg.txt, and several others in the series.
===========================================================
There are more errors in gold csv files. (Qianjin)
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998
NY01075760_lg no datesetName
NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)
NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber
NY01075767_lg verbatimEventDate format
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)
NY01075768_lg country (canada), it hsould be (ca.)
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)
NY01075770_lg habitat (Host determined by A. R. Grant)
NY01075771_lg verbatimCoordinates mixed with verbatimLocality
NY01075779_lg habitat concatenation
NY01075780_lg NEW YOUR BOTANICAL GARDEN
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.
NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.
NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075818_lg no scientificName
NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075821_lg scientificName (null)
NY01075821_lg no scientificName
NY01075822_lg no scientificName
NY01075823_lg identifiedBy
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file. 
TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))
TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)
TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)
TENN-L-0000021_lg verbatimInstitution ((TENN))
TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)
TENN-L-0000033_lg no catalogNumber in OCRed text file
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000045_lg recordNumber (null)
TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.
TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)
TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.
TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)
TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.
TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.
TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)
TENN-L-0000063_lg verbatimLocality contains scientific name
TENN-L-0000063_lg verbatimScientificName (Amherst)
TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)
TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)
TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)
TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.
TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.
TENN-L-0000077_lg identifiedBy (Date) in the csv file
TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.
TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)
TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)
TENN-L-0000084_lg scientificName (null)
TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file
TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.
WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.
WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.
WIS-L-0012026_lg no datasetName
WIS-L-0012038_lg no verbatimCoordinates
WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.
WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file
WIS-L-0012045_lg verbatimCoordinates concatenation
WIS-L-0012051_lg dateIdentified (format)
WIS-L-0012055_lg verbatimEventDate (format)
WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is  (2003-July-19) in the csv file
WIS-L-0012056_lg dateIdentified (format)
WIS-L-0012057_lg no datesetName
WIS-L-0012064_lg verbatimCoordinates concatenation
WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file
WIS-L-0012074_lg county (null)
WIS-L-0012074_lg county (null)
WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates
====================================================
====================================================


7

edits