Dataset Errata: Difference between revisions

m
Line 3: Line 3:
New Errors 2/27/13
New Errors 2/27/13


===Gold Parsing Errors===
=== Gold Parsing Errors ===
    ===Lichen Gold===
    ====verbatimLatitude and verbatimLongitude====
<br>
<br>
'''Many of the Lichen Gold labels''' have '''verbatimLatitude''' and '''verbatimLongitude''', but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. ('''Daryl''')
=== Lichen Gold ===
<br>
<br>
('''Bryan:''' I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.) <br>
==== verbatimLatitude and verbatimLongitude ====
('''Ed''': Verbatim field contain verbatim results. No lichen labels have DwC complaint decimal coordinates. Likewise, no labels has DwC compliant event dates, thus '''you probably only want to only use verbatim fields for stats''')
Check this with Alex --[[User:Dpaul|Dpaul]] 16:33, 12 June 2013 (EDT)


====elevation verbatimElevation====
<br> '''Many of the Lichen Gold labels''' have '''verbatimLatitude''' and '''verbatimLongitude''', but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. ('''Daryl''') <br> ('''Bryan:''' I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.) <br> ('''Ed''': Verbatim field contain verbatim results. No lichen labels have DwC complaint decimal coordinates. Likewise, no labels has DwC compliant event dates, thus '''you probably only want to only use verbatim fields for stats''') Check this with Alex --[[User:Dpaul|Dpaul]] 16:33, 12 June 2013 (EDT)
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. ('''Daryl''')  
 
<br>
==== elevation verbatimElevation ====
('''Bryan:''' Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.) <br>
 
('''Deb''': What are the ramifications then? Does the lichen set need to be fixed in this regard? or just ignore derived columns and expect letters like m or mi or ft in verbatimElevation field?)--[[User:Dpaul|Dpaul]] 16:40, 12 June 2013 (EDT)
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. ('''Daryl''') <br> ('''Bryan:''' Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.) <br> ('''Deb''': What are the ramifications then? Does the lichen set need to be fixed in this regard? or just ignore derived columns and expect letters like m or mi or ft in verbatimElevation field?)--[[User:Dpaul|Dpaul]] 16:40, 12 June 2013 (EDT)  
 
==== Gold Parsed Country errors ====
 
Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. ('''Daryl''') <br>('''Bryan''': I think for Gold the field should not be filled in if it is not on the label.) <br> <br> <br> Many Gold Parse Tennessee lichen labels have country errors. <br>Examples:
 
('''Deb''': if there are many '''country''' errors on these, where the error is one of putting country in by deducing it because the state is in the USA, ...removing the value will take some time to fix as every single record will need to be verified. --[[User:Dpaul|Dpaul]] 16:40, 12 June 2013 (EDT))<br>


====Gold Parsed Country errors====
-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.'''(Daryl''') &nbsp; <br> ('''Bryan''': Agreed. Should be fixed to match the label.) <br> ('''Ed''': fixed, note that TENN-L-0000035_lg.txt has "U. S. A. " with spaces, thus conserved format)
Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. ('''Daryl''')  
<br>('''Bryan''': I think for Gold the field should not be filled in if it is not on the label.)  
<br>
<br>
<br>
Many Gold Parse Tennessee lichen labels have country errors.  
<br>Examples:


('''Deb''': if there are many '''country''' errors on these, where the error is one of putting country in by deducing it because the state is in the USA, ...removing the value will take some time to fix as every single record will need to be verified. --[[User:Dpaul|Dpaul]] 16:40, 12 June 2013 (EDT))<br>
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. ('''Daryl)''' <br> ('''Bryan''': Agreed. Should be fixed to match the OCR label.) <br> ('''Ed''': Fixed, country had county value) <br> <br> <br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:


-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.'''(Daryl''') &nbsp;
==== dateIdentified or verbatimEventDate errors ====
<br>
('''Bryan''': Agreed. Should be fixed to match the label.)
<br>
('''Ed''': fixed, note that TENN-L-0000035_lg.txt has "U. S. A. " with spaces, thus conserved format)


-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. ('''Daryl)'''
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. <br> '''(Bryan''': I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. ) <br> ('''Ed''': The verbatimEventDate (not dateIdentified is in the correct format, but if you open file in excel it will convert the display to match program's defaults)).  
<br>
('''Bryan''': Agreed. Should be fixed to match the OCR label.)
<br>
('''Ed''': Fixed, country had county value)
<br>
<br>
<br>
Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:


====dateIdentified or verbatimEventDate errors====
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp; <br> ('''Bryan''': Agreed. Should be fixed to match the label.) <br> ('''Ed''': That is the collection date, not the dateIdentified. Format is correct in verbatimDate)  
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.  
<br>  
'''(Bryan''': I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )  
<br>  
('''Ed''': The verbatimEventDate (not dateIdentified is in the correct format, but if you open file in excel it will convert the display to match program's defaults)).


-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). ('''Daryl)'''&nbsp; <br> ('''Bryan''': Agreed. Should be fixed to match the label.)  
<br>  
('''Bryan''': Agreed. Should be fixed to match the label.)
<br>
('''Ed''': That is the collection date, not the dateIdentified. Format is correct in verbatimDate)


-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). ('''Daryl)'''&nbsp;
==== comma, space, period, apostrophe in verbatimCoordinates errors ====
<br>
('''Bryan''': Agreed. Should be fixed to match the label.)


====comma, space, period, apostrophe in verbatimCoordinates errors====
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)&nbsp; <br> ('''Bryan''': Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)  
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)&nbsp;
<br>  
('''Bryan''': Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)  


Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.&nbsp;
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.&nbsp; <br> ('''Bryan:''' Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)  
<br>  
('''Bryan:''' Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)  


Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)  
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)  
Line 94: Line 61:
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)  
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)  


'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)  
'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin) '''(Bryan: I agree with Qianjin's edits except as noted below)'''  
'''(Bryan: I agree with Qianjin's edits except as noted below)'''


NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998  
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998  
Line 109: Line 75:
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)  
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)  


NY01075768_lg country (canada), it should be (ca.)&nbsp;
NY01075768_lg country (canada), it should be (ca.)&nbsp;  


NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)  
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)  
Line 117: Line 83:
NY01075771_lg verbatimCoordinates mixed with verbatimLocality  
NY01075771_lg verbatimCoordinates mixed with verbatimLocality  


NY01075779_lg habitat concatenation (Bryan: "on Protoblastenia rupestris" appears before the location and habitate section of the label. However, Habitate says "dolomite rock along lake shore and adjacent Thuja forest; on Protoblastenia rupestris". There was a period after "forest". The period was removed and a ";" added. Then the "on Protoblastenia rupestris" from the earlier part of the label was concatinated.&nbsp;
NY01075779_lg habitat concatenation (Bryan: "on Protoblastenia rupestris" appears before the location and habitate section of the label. However, Habitate says "dolomite rock along lake shore and adjacent Thuja forest; on Protoblastenia rupestris". There was a period after "forest". The period was removed and a ";" added. Then the "on Protoblastenia rupestris" from the earlier part of the label was concatinated.&nbsp;  


NY01075780_lg NEW YOUR BOTANICAL GARDEN (Bryan: the label said "GARDEN". The OCR said "CARDEN". SIlver should be "CARDEN" Gold should be "GARDEN"
NY01075780_lg NEW YOUR BOTANICAL GARDEN (Bryan: the label said "GARDEN". The OCR said "CARDEN". SIlver should be "CARDEN" Gold should be "GARDEN"  


NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.  
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.  
Line 145: Line 111:
NY01075822_lg no scientificName  
NY01075822_lg no scientificName  


NY01075823_lg identifiedBy (Bryan: ?? I do not see the problem)
NY01075823_lg identifiedBy (Bryan:&nbsp;?? I do not see the problem)  


TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation  
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation  
Line 151: Line 117:
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).  
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).  


TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file. (Bryan: on the label the word is hyphenated and split across lines. SO, if we ignore the "New Line" the word is "apria -cas" and not what Qianjin listed. so, folloing the rule for gold of making perfect OCR, character by character, the csv shoudl say&nbsp;"apria -cas")
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file. (Bryan: on the label the word is hyphenated and split across lines. SO, if we ignore the "New Line" the word is "apria -cas" and not what Qianjin listed. so, folloing the rule for gold of making perfect OCR, character by character, the csv shoudl say&nbsp;"apria -cas")  


TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.  
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.  
Line 263: Line 229:
NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u". We may want to do this, but if we do it should be standardized and consistent across all the labels. Same for NY01075791_lg.txt, and several others in the series.  
NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u". We may want to do this, but if we do it should be standardized and consistent across all the labels. Same for NY01075791_lg.txt, and several others in the series.  


<br> '''Silver Parsed CSV Files'''  
<br> '''Silver Parsed CSV Files''' '''(Bryan: I do not get most of these. There should be OCR errors in silver. We do need to stay true to the OCR output.)&nbsp;'''  
'''(Bryan: I do not get most of these. There should be OCR errors in silver. We do need to stay true to the OCR output.)&nbsp;'''


"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)  
"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)  
Line 272: Line 237:
NY01075761_lg misspelling in verbatimScientificName  
NY01075761_lg misspelling in verbatimScientificName  


NY01075762_lg misspelling in habitat misspelling in verbatimLocality (Bryan: I think the problem is that habitate is concatinated with the substraight "Abies [7a[5amzfem—Betu[a papyrzfem forest over granite adjacent to waterfall, parasite on Peltigera scabrosa". the parasite part is from another part of the label. It shoudl be on a new row.)
NY01075762_lg misspelling in habitat misspelling in verbatimLocality (Bryan: I think the problem is that habitate is concatinated with the substraight "Abies [7a[5amzfem—Betu[a papyrzfem forest over granite adjacent to waterfall, parasite on Peltigera scabrosa". the parasite part is from another part of the label. It shoudl be on a new row.)  


NY01075764_lg misspelling in units for verbatimElevation  
NY01075764_lg misspelling in units for verbatimElevation  
4,713

edits