Dataset Errata

From iDigBio
Jump to navigation Jump to search

Errors noted in various files

New Errors 2/27/13

Gold Parsing Errors


Lichen Gold


verbatimLatitude and verbatimLongitude


Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl)
(Bryan: I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.)
(Ed: Verbatim field contain verbatim results. No lichen labels have DwC complaint decimal coordinates. Likewise, no labels has DwC compliant event dates, thus you probably only want to only use verbatim fields for stats) Check this with Alex --Dpaul 16:33, 12 June 2013 (EDT)

elevation verbatimElevation

This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl)
(Bryan: Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.)
(Deb: What are the ramifications then? Does the lichen set need to be fixed in this regard? or just ignore derived columns and expect letters like m or mi or ft in verbatimElevation field?)--Dpaul 16:40, 12 June 2013 (EDT)

Gold Parsed Country errors

Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl)
(Bryan: I think for Gold the field should not be filled in if it is not on the label.)
(Deb: Note, to be clear!, the country is often in the DatasetName and from that, most (all?) of us put the country listed there in the dwc:country field). This seems fine and right to me.
(Deb: A different issue is putting in the country when it's not listed anywhere on the label -- that's indeed inference -- and we ought to try and clean that up where it happened. IMHO. --Dpaul 21:00, 13 June 2013 (EDT)).


Many Gold Parse Tennessee lichen labels have country errors.
Examples:

(Deb: if there are many country errors on these, where the error is one of putting country in by deducing it because the state is in the USA, ...removing the value will take some time to fix as every single record will need to be verified. --Dpaul 16:40, 12 June 2013 (EDT))

-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods).
Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)  
(Bryan: Agreed. Should be fixed to match the label.)
(Ed: fixed, note that TENN-L-0000035_lg.txt has "U. S. A. " with spaces, thus conserved format)

-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl)
(Bryan: Agreed. Should be fixed to match the OCR label.)
(Ed: Fixed, country had county value)


Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:

dateIdentified or verbatimEventDate errors

-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.
(Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )
(Ed: The verbatimEventDate (not dateIdentified is in the correct format, but if you open file in excel it will convert the display to match program's defaults)).
Deb - verified verbatimEventDate="Nov. 12, 1939 in csv --Dpaul 21:21, 13 June 2013 (EDT)
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963 
(Bryan: Agreed. Should be fixed to match the label.)
(Ed: That is the collection date, not the dateIdentified. Format is correct in verbatimDate)
Deb: this is a confusing label! the line begins with

Det.:Barbara Moore, this is followed by Date: 3 Feb 1963.
(A person who knows this collector's label and habits would know if the date represents collected or identified date. An outsider may very well conclude eventDate (date collected) is missing.--Dpaul 21:21, 13 June 2013 (EDT)


-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl) 
(Bryan: Agreed. Should be fixed to match the label.)
Deb: in csv, dwc:verbatimEventDate=8 Aug 1954
dwc:eventDate=1954-08-08
dwc:dateIdentified=1954-Aug-8
it's this last field that's funky and I"ve changed it to be 1954-08-08 --Dpaul 21:32, 13 June 2013 (EDT)

comma, space, period, apostrophe in verbatimCoordinates errors

Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.) 
(Bryan: Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)

Fixed --Dpaul 21:49, 13 June 2013 (EDT) See this for how csv looks now.
NY01075760_lg.csv:Richard C. Harris,52766,"38°42'20""N 83°08'25""W",21 May 2006,2006-05-21,,Scioto,Ohio,U.S.A.,Polycoccum minutulum Kocourková & 'F. Berger,"Shawnee State Forest, along Pond Lick Run N of Forest Road 1, 0.35 mi SW of SR 125",sandstone boulders along stream and adjacent oak woods.,on Trapeliopsis placodioides Coppins & James,ca. 190 m,,,"38°42'20""N",83°08'25'W,1075760,New York Botanical Garden,,Polycoccum minutulum Kocourková & F. Berger,,,,


Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates. 
(Bryan: Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)

Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)

Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors). (Bryan: hmmm. Gold shoudl be as-if the OCR engine read the label with no mistakes. Silver would leave it as 0107576 but in Gold we shoudl put what was on the label. That would include the "1".)

Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file. (Bryan: Agreed. Should be fixed to match the label.)

Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file. (Bryan: Agreed. Should be fixed to match the Label.)

Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio" (Bryan: Agreed. Should be fixed to match the label.)

Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.(Bryan: Agreed. Should be fixed to match the Label.)

Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...? (Bryan: This is complex but in this case I think it should be "Town of Peru". The bald "Peru" could be read as an error with the misplacment of "Country". Likely the Label author was worried about the same thing and included "Town of". In both cases include what is on the label.)

Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA" (Bryan: Agreed. Should be assigned to "country".)

Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label. (Bryan: Agreed. Should be fixed to match the label.)

Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others.  (Bryan: Agreed. Should be fixed to match the label. I think if the OCR had been perfect the space would not be n the OCR file do it is a tough call.)

Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)


Gold Parsed CSV Files

Lichen NY

There are more errors in gold csv files. (Qianjin)
(Bryan: I agree with Qianjin's edits except as noted below)

NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998

From Deb: path /home/aocr/webroot/datasets/lichens/gold/parsed the csv file, opened in Notepad++ has 19 April 1998, encoding is ANSI --Dpaul 21:20, 30 June 2013 (EDT)
here's what it looks like when opened in Notepad++
dwc:recordedBy, dwc:recordNumber, dwc:verbatimCoordinates, dwc:verbatimEventDate, dwc:eventDate, dwc:municipality, dwc:county, dwc:stateProvince, dwc:country, aocr:verbatimScientificName, dwc:verbatimLocality, dwc:habitat, dwc:substrate, dwc:verbatimElevation, dwc:identifiedBy, dwc:dateIdentified, dwc:verbatimLatitude, dwc:verbatimLongitude, dwc:catalogNumber, aocr:verbatimInstitution, dwc:datasetName, dwc:scientificName, dwc:decimalLatitude, dwc:decimalLongitude, dwc:fieldNotes, dwc:sex
Richard C. Harris,42164,"41°11'N, 74°08'W",19 April 1998,1998-04-19,,Rockland,NEW YORK,U.S.A.,,"Harriman State Park, along Woodtown Road West near dam at S end of Lake Sebago along Seven Lakes Drive",mixed hardwood-hemlock forest with granitic erratics.,on Trapelia placodioides Coppins & P. James,ca. 240 m,,,41°11'N,74°08'W,01075759,New York Botanical Garden,Lichens of New York State,Polycoccum minutulum Kocourkova & F. Berger,,,,
here's what it looks like when I open it at the command line with text editor (vi)
dwc:recordedBy, dwc:recordNumber, dwc:verbatimCoordinates, dwc:verbatimEventDate, dwc:eventDate, dwc:municipality, dwc:county, dwc:stateProvince, dwc:country, aocr:verbatimScientificName, dwc:verbatimLocality, dwc:habitat, dwc:substrate, dwc:verbatimElevation, dwc:identifiedBy, dwc:dateIdentified, dwc:verbatimLatitude, dwc:verbatimLongitude, dwc:catalogNumber, aocr:verbatimInstitution, dwc:datasetName, dwc:scientificName, dwc:decimalLatitude, dwc:decimalLongitude, dwc:fieldNotes, dwc:sex
Richard C. Harris,42164,"41°11'N, 74°08'W",1998-04-19,4/19/1998,,Rockland,NEW RK,U.S.A.,,"Harriman State Park, along Woodtown Road West near dam at S end of Lake Sebago along Seven Lakes Drive",mixed hardwood-hemlock forest with granitic erratics.,on Trapelia placodioides Coppins & P. James,ca. 240 m,,,41°11'N,74°0W,01075759,New York Botanical Garden,Lichens of New York State,Polycoccum minutulum Kocourkova & F. Berger,,,,


NY01075760_lg no datesetName

FIXED in /home/aocr/webroot/datasets/lichens/gold/parsed --Dpaul 23:18, 30 June 2013 (EDT)

NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)

FIXED in /home/aocr/webroot/datasets/lichens/gold/parsed --Dpaul 23:18, 30 June 2013 (EDT)

NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber

FIXED in /home/aocr/webroot/datasets/lichens/gold/parsed --Dpaul 23:18, 30 June 2013 (EDT)
NOTE since there's no host field, the shell of Balanus balanoides is in habitat which really is not right --Dpaul 23:18, 30 June 2013 (EDT)

NY01075767_lg verbatimEventDate format

FIXED in /home/aocr/webroot/datasets/lichens/gold/parsed --Dpaul 23:18, 30 June 2013 (EDT)

NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)

See error above, Label reads July 1979 I fixed the verbatimEventDate to match.--Dpaul 23:18, 30 June 2013 (EDT)

NY01075768_lg country (canada), it should be (ca.) 

Fixed verbatim locality to include the ca. (iow, about) --Dpaul 23:18, 30 June 2013 (EDT)

NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)

fixed habitat to read: on Acmaea digitalis Eschsch. Host determined by A. R. Grant.
removed A. R. Grant. from identifiedBY (she or he identified the host, not the lichen)--Dpaul 23:18, 30 June 2013 (EDT)

NY01075770_lg habitat (Host determined by A. R. Grant)

see above --Dpaul 23:18, 30 June 2013 (EDT)

NY01075771_lg verbatimCoordinates mixed with verbatimLocality

verbatimLocality -includes all information in the locality -- looks correct to me.--Dpaul 23:18, 30 June 2013 (EDT)
verbatim coordinates on this record probably ought to be blank since the collector essentially provided 3 different sets of lat / lon values.--Dpaul 23:18, 30 June 2013 (EDT)
I did not change this record.--Dpaul 23:18, 30 June 2013 (EDT)

NY01075779_lg habitat concatenation (Bryan: "on Protoblastenia rupestris" appears before the location and habitate section of the label. However, Habitate says "dolomite rock along lake shore and adjacent Thuja forest; on Protoblastenia rupestris". There was a period after "forest". The period was removed and a ";" added. Then the "on Protoblastenia rupestris" from the earlier part of the label was concatinated. 

changed dwc:habitat to:
on Protoblastenia rupestris dolomite rock along lake shore and adjacent Thuja forest.

--Dpaul 23:19, 30 June 2013 (EDT)

NY01075780_lg NEW YOUR BOTANICAL GARDEN (Bryan: the label said "GARDEN". The OCR said "CARDEN". SIlver should be "CARDEN" Gold should be "GARDEN"

the OCR I see says "Garden" in /home/aocr/webroot/datasets/lichens/silver/ocr --Dpaul 23:19, 30 June 2013 (EDT)
fixed the Gold (changed from Carden to Garden), left the silver alone.--Dpaul 23:19, 30 June 2013 (EDT)

NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.

fixed in gold csv (path is /home/aocr/webroot/datasets/lichens/gold/parsed) --Dpaul 23:19, 30 June 2013 (EDT)

NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed in gold csv (path is /home/aocr/webroot/datasets/lichens/gold/parsed) --Dpaul 23:19, 30 June 2013 (EDT)

NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.

changed to S.C. in the csv --Dpaul 23:19, 30 June 2013 (EDT)

NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075818_lg no scientificName

not null in the csv record I see in /home/aocr/webroot/datasets/lichens/gold/parsed
but the umlaut was missing from the txt file and the csv -- so I fixed that. --Dpaul 23:19, 30 June 2013 (EDT)

NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075821_lg scientificName (null)

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075821_lg no scientificName

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075822_lg no scientificName

fixed --Dpaul 23:19, 30 June 2013 (EDT)

NY01075823_lg identifiedBy (Bryan: ?? I do not see the problem)

From Deb. We (the herb set) did put determinations on a separate line, as done in this file. We did not do it the same way, however. Need to discuss.--Dpaul 23:19, 30 June 2013 (EDT)
I did put in umlauts for the u's in the name Müll. (they are in image, not in csv or txt,but they should be).--Dpaul 23:19, 30 June 2013 (EDT)

TODO: fix all instances of Mull to Müll in txt and csv gold lichen at paths /home/aocr/webroot/datasets/lichens/gold/ocr and /home/aocr/webroot/datasets/lichens/gold/parsed --Dpaul 23:19, 30 June 2013 (EDT)

FIXED in ocr txt files and parsed csv files. --Dpaul 15:21, 1 July 2013 (EDT)

Lichen TENN

TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation

NOT FIXED i see no problem with this, the elevation is embedded in the locality. Then the relevant value is in the dwc:verbatimElevation (1500 m) field too.--Dpaul 15:47, 1 July 2013 (EDT)

TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).

verbatimLocality now contains ROCKY MOUNTAIN NATIONAL PARK, Exposure W, NW. Longs Peak, top of Trough and
habitat contains "'Crystalline rocks. Exposure W, NW.--Dpaul 15:47, 1 July 2013 (EDT)

TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file. (Bryan: on the label the word is hyphenated and split across lines. SO, if we ignore the "New Line" the word is "apria -cas" and not what Qianjin listed. so, folloing the rule for gold of making perfect OCR, character by character, the csv shoudl say "apria -cas")

FIXED (txt file is good as is, changed csv file to have a hyphen in the locality text string reading ...'ad rupes apri-cas' --Dpaul 15:47, 1 July 2013 (EDT)

TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.

Not what I see in gold csv. I see recordedBy:H. Kashiwadani and Y. Endo and identifiedBy:H. Kashiwadani which is correct.
no changes made. gold txt file is correct too.--Dpaul 16:08, 1 July 2013 (EDT)

TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))

From Deb: verbatimInstitution looks fine in the csv=HERBARIUM OF THE UNIVERSITY OF TENNESSEE -- that's what's on the actual label and in the gold txt file. --Dpaul 16:08, 1 July 2013 (EDT)
the person transcribing the image, put the name listed on the barcode in the gold txt file and it reads University of Tennessee (TENN)
NOT FIXED - I did not adjust these, they seem interpreted correctly to me.--Dpaul 16:08, 1 July 2013 (EDT)

TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)

No error here - interpretation correct afaikt in gold csv and gold txt. not changed. --Dpaul 16:08, 1 July 2013 (EDT)

TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))

FIXED gold csv, label verbatimInstitution reads: HERBARIUM OF THE UNIVERSITY OF TENNESSEE now gold csv reads the same, as does the gold txt file --Dpaul 16:08, 1 July 2013 (EDT)

TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))

FIXED gold csv, label verbatimInstitution reads: HERBARIUM OF THE UNIVERSITY OF TENNESSEE now gold csv reads the same, as does the gold txt file --Dpaul 16:08, 1 July 2013 (EDT)

TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)

FIXED gold csv. on original label, identifiedBy and dateidentifed are not provided. --Dpaul 16:22, 1 July 2013 (EDT)

TENN-L-0000021_lg verbatimInstitution ((TENN))

FIXED changed to:HERBARIUM OF THE UNIVERSITY OF TENNESSEE --Dpaul 16:22, 1 July 2013 (EDT)

TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)

FIXED, removed date from dateIdentified field as label appears to imply date collected and no way to be sure the identified date is the same. --Dpaul 16:22, 1 July 2013 (EDT)

TENN-L-0000033_lg no catalogNumber in OCRed text file

FIXED, put 67 in the recordnumber field, catalognumber is present (TENN-L-0000033) in the gold csv file I am looking at in the /home/aocr/webroot/datasets/lichens/gold/parsed csv files. --Dpaul 16:22, 1 July 2013 (EDT)

TENN-L-0000036_lg verbatimEventDate (format)

FIXED to read: Nov. 3, 1956 --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000036_lg eventDate (format)

FIXED to read 1956-11-03 --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000045_lg recordNumber (null)

TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.

FIXED, changed dwc:stateProvince to literal label = Mont. --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)

FIXED, removed near from habitat --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.

FIXED, changed dwc:stateProvince to literal label = Mont. --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.) FIXED, put Alt.: About 3500 ft. in dwc:verbatimElevation --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.

GET ED TO CHECK. --Dpaul 16:49, 1 July 2013 (EDT)
(I think) this lichen has 3 identifications. One original in 1969, two by J P Dey (one in 1973-74 and another in 1980)... and so there ought to be 3 lines (not 2) which is what I did. --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.

From Deb: note that ! means agreement with determination present. But in the nature of "exact" and "verbatim" transcription and parsing, I'm not sure of the correct tactic. I'd think we only want to parse the name and not the ! --Dpaul 16:49, 1 July 2013 (EDT)
YOUR INPUT? --Dpaul 16:49, 1 July 2013 (EDT)

TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)

TENN-L-0000063_lg verbatimLocality contains scientific name

TENN-L-0000063_lg verbatimScientificName (Amherst)

TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)

TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)

TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)

TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.

TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.

TENN-L-0000077_lg identifiedBy (Date) in the csv file

TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.

TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)

TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)

TENN-L-0000084_lg scientificName (null)

TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file

TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.

Lichen WIS

WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.

WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.

WIS-L-0012026_lg no datasetName

WIS-L-0012038_lg no verbatimCoordinates

WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.

WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file

WIS-L-0012045_lg verbatimCoordinates concatenation

WIS-L-0012051_lg dateIdentified (format)

WIS-L-0012055_lg verbatimEventDate (format)

WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is (2003-July-19) in the csv file

WIS-L-0012056_lg dateIdentified (format)

WIS-L-0012057_lg no datesetName

WIS-L-0012064_lg verbatimCoordinates concatenation

WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file

WIS-L-0012074_lg county (null)

WIS-L-0012074_lg county (null)

WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates (Qianjin)


Gold OCR Errors

NY01075761_lg.txt has catalogNumber as 0107576, omitting the 1 at the end.

WIS-L-0012026_lg.txt:: Several errors: Replaced the "N" in Latitude with a "K". Question mark instead of apostrophe in Longitude. Sandra Looman replace with Sandra Lcoman. Two dots after the date.

TENN-L-0000029_lg.txt adds a "1" to the scientificName ("Actinogyra muhlenbergii 1 (Ach.) Schol.").

NY01075791_lg.txt converted "Müll" on the original label NY01075791_lg.jpg to "Mull" (converted umlaut "ü" to "u". We may want to do this, but if we do it should be standardized and consistent across all the labels. Same for NY01075791_lg.txt, and several others in the series.


Silver Parsed CSV Files (Bryan: I do not get most of these. There should be OCR errors in silver. We do need to stay true to the OCR output.) 

"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)


NY01075760_lg character encoding in verbatimScientificName typos in verbatimCoordinates

NY01075761_lg misspelling in verbatimScientificName

NY01075762_lg misspelling in habitat misspelling in verbatimLocality (Bryan: I think the problem is that habitate is concatinated with the substraight "Abies [7a[5amzfem—Betu[a papyrzfem forest over granite adjacent to waterfall, parasite on Peltigera scabrosa". the parasite part is from another part of the label. It shoudl be on a new row.)

NY01075764_lg misspelling in units for verbatimElevation

NY01075765_lg character encoding in verbatimScientificName removed extra period in verbatimEventDate

NY01075768_lg separated verbatimLocality data into two columns

NY01075769_lg misspelling in habitat

NY01075770_lg character encoding in verbatimScientificName character encoding in habitat

NY01075773_lg misspelling in verbatimScientificName misspelling in verbatimLocality

NY01075774_lg character encoding in verbatimScientificName

NY01075775_lg misspelling in country

NY01075776_lg character encoding in verbatimLocality

NY01075777_lg character encoding in country

NY01075779_lg character encoding in verbatimCoordinates

NY01075780_lg misspelling in verbatimInstitution misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075781_lg character encoding in verbatimElevation

NY01075782_lg separated verbatimLocality data into two columns removed coordinates in verbatimLocality character encoding in habitat

NY01075786_lg misspelling in verbatimScientificName

NY01075787_lg misspelling in verbatimLocality removed coordinates in verbatimLocality misspelling in verbatimCoordinates misspelling in habitat

NY01075788_lg misspelling in verbatimLocality removed coordinates in verbatimLocality character encoding in verbatimCoordinates

NY01075789_lg misspelling in verbatimLocality removed coordinates in verbatimLocality character encoding in verbatimCoordinates

NY01075790_lg misspelling in habitat separated verbatimLocality data into three columns removed coordinates in verbatimLocality

NY01075791_lg character encoding in verbatimScientificName

NY01075792_lg misspelling in verbatimLocality

NY01075794_lg misspelling in verbatimLocality

NY01075795_lg misspelling in verbatimLocality

NY01075802_lg character encoding in verbatimScientificName

NY01075803_lg created new identifiedBy column created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075805_lg created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075806_lg character encoding in verbatimScientificName

NY01075813_lg misspelling in verbatimLocality

NY01075814_lg misspelling in county misspelling in verbatimLocality removed coordinates in verbatimLocality misspelling in habitat

NY01075817_lg moved verbatimScientificName data to scientificName entered verbatimScientificName

NY01075818_lg misspelling in habitat

NY01075819_lg misspelling in recordedBy

NY01075821_lg misspelling in verbatimLocality removed coordinates in verbatimLocality added coordinates to verbatimCoordinates

NY01075822_lg removed coordinates in verbatimLocality

NY01075823_lg moved identifiedBy and dateIdentified data up one row created new verbatimScientificName column moved verbatimScientificName data from third row to new column

NY01075827_lg misspelling in county misspelling in verbatimLocality

NY01075828_lg misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075829_lg misspelling in habitat

NY01075831_lg misspelling in verbatimLocality removed coordinates in verbatimLocality

NY01075837_lg misspelling in county

TENN-L-0000001_lg character encoding in occurrenceRemarks misspelling in habitat character encoding in verbatimLocality

TENN-L-0000002_lg character encoding in verbatimScientificName misspelling in habitat

TENN-L-0000004_lg misspelling in habitat misspelling in verbatimInstitution

TENN-L-0000005_lg misspelling in datasetName misspelling in occurrenceRemarks character encoding in verbatimLocality

TENN-L-0000006_lg misspelling in verbatimElevation edited verbatimEventDate

TENN-L-0000007_lg separated verbatimLocality into two columns misspellings in both verbatimLocality columns

TENN-L-0000009_lg character encoding in habitat character encoding in catalogNumber

TENN-L-0000010_lg separated verbatimLocality into two columns

TENN-L-0000012_lg character encoding in datasetName character encoding in occurrenceRemarks

TENN-L-0000013_lg misspelling in occurrenceRemarks misspelling in verbatimLocality

TENN-L-0000014_lg misspelling in datasetName misspelling in fieldNotes character encoding in verbatimLocality separated recordedBy into two columns

TENN-L-0000022_lg character encoding in recordedBy

TENN-L-0000027_lg character encoding in verbatimScientificName

TENN-L-0000028_lg character encoding in verbatimScientificName

TENN-L-0000029_lg misspelling in recordedBy

TENN-L-0000032_lg character encoding in verbatimScientificName

TENN-L-0000033_lg separated dataSetName into two columns separated fieldNotes into two columns

TENN-L-0000041_lg character encoding in datasetName misspelling in verbatimLocality

TENN-L-0000044_lg character encoding in datasetName

TENN-L-0000045_lg separated verbatimLocality into two columns

TENN-L-0000046_lg character encoding in datasetName

TENN-L-0000047_lg character encoding in datasetName

TENN-L-0000048_lg misspelling in verbatimLocality

TENN-L-0000049_lg separated verbatimLocality into two columns

TENN-L-0000051_lg character encoding in verbatimLocality

TENN-L-0000052_lg character encoding in verbatimScientificName character encoding in datasetName character encoding in habitat character encoding in verbatimLocality character encoding in recordedBy

TENN-L-0000053_lg character encoding in recordNumber

TENN-L-0000054_lg character encoding in datasetName

TENN-L-0000056_lg edited recordedBy

TENN-L-0000057_lg character encoding in verbatimLocality misspelling in verbatimInstitution

TENN-L-0000058_lg separated dataSetName into two columns character encoding in verbatimInstitution

TENN-L-0000059_lg character encoding in stateProvince character encoding in verbatimScientificName character encoding in verbatimCoordinates misspelling in recordedBy

TENN-L-0000061_lg edited verbatimLocality misspelling in recordedBy

TENN-L-0000063_lg separated dataSetName into two columns character encoding in identificationRemarks

TENN-L-0000064_lg character encoding in verbatimScientificName

TENN-L-0000065_lg character encoding in verbatimScientificName

TENN-L-0000068_lg edited habitat character encoding in verbatimInstitution

TENN-L-0000072_lg separated verbatimLocality into two columns misspelling in country edited verbatimScientificName character encoding in verbatimInstitution

TENN-L-0000073_lg misspelling in verbatimLocality character encoding in verbatimCoordinates misspelling in recordedBy

TENN-L-0000074_lg character encoding in recordedBy character encoding in verbatimScientificName character encoding in verbatimLocality

TENN-L-0000075_lg character encoding in datasetName misspelling in verbatimScientificName separated verbatimLocality into two columns character encoding in both verbatimLocality columns character encoding in verbatimCoordinates

TENN-L-0000076_lg misspelling in datasetName character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in both verbatimLocality columns character encoding in recordedBy

TENN-L-0000077_lg character encoding in county character encoding in verbatimLocality character encoding in catalogNumber

TENN-L-0000079_lg character encoding in verbatimInstitution

TENN-L-0000080_lg character encoding in catalogNumber

TENN-L-0000083_lg character encoding in verbatimScientificName

TENN-L-0000084_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in verbatimLocality

TENN-L-0000087_lg character encoding in recordNumber character encoding in habitat character encoding in verbatimLocality character encoding in verbatimInstitution

TENN-L-0000089_lg misspelling in country separated verbatimLocality into two columns misspelling in verbatimLocality misspelling in verbatimInstitution misspelling in datasetName

TENN-L-0000090_lg character encoding in verbatimInstitution

TENN-L-0000091_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in catalogNumber

TENN-L-0000093_lg edited verbatimLocality character encoding in catalogNumber

TENN-L-0000095_lg character encoding in verbatimScientificName edited country character encoding in verbatimLocality

TENN-L-0000097_lg character encoding in verbatimScientificName

TENN-L-0000098_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimEventDate character encoding in verbatimCoordinates

TENN-L-0000099_lg separated dataSetName into two columns character encoding in stateProvince misspelling in verbatimScientificName character encoding in verbatimLocality character encoding in verbatimLatitude character encoding in catalogNumber

WIS-L-0011726_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in verbatimElevation character encoding in recordedBy

WIS-L-0011727_lg character encoding in verbatimScientificName separated verbatimLocality into two columns misspelling in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011728_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0011729_lg separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011730_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in habitat

WIS-L-0011731_lg character encoding in verbatimScientificName character encoding in identifiedBy separated verbatimLocality into two columns misspelling in associatedTaxa misspelling in verbatimElevation

WIS-L-0011732_lg separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0011733_lg character encoding in verbatimLocality character encoding in habitat character encoding in verbatimCoordinates

WIS-L-0011734_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in habitat character encoding in recordNumber separated verbatimLocality into two columns

WIS-L-0011736_lg character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012025_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012026_lg character encoding in verbatimScientificName

WIS-L-0012027_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012028_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat misspelling in verbatimElevation

WIS-L-0012029_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012030_lg character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012031_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat


WIS-L-0012032_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012033_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates misspelling in verbatimElevation

WIS-L-0012034_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012035_lg character encoding in verbatimScientificName separated verbatimLocality into two columns misspelling in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012036_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012037_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012039_lg character encoding in verbatimScientificName character encoding in verbatimLocality character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012040_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012041_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012042_lg character encoding in datasetName character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012043_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat separated verbatimLocality into two columns

WIS-L-0012044_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012045_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012046_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012047_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012048_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012049_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012050_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012051_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012052_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012053_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012054_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012055_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012056_lg separated verbatimLocality into two columns character encoding in habitat

WIS-L-0012057_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimCoordinates

WIS-L-0012058_lg separated verbatimLocality into two columns character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012059_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012060_lg character encoding in verbatimScientificName character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat

WIS-L-0012061_lg character encoding in verbatimScientificName separated verbatimLocality into two columns removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012062_lg character encoding in verbatimScientificName character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates character encoding in habitat character encoding in verbatimInstitution

WIS-L-0012063_lg character encoding in verbatimScientificName character encoding in verbatimLocality removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012064_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012065_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in habitat

WIS-L-0012067_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLatitude character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012068_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012069_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012070_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012071_lg character encoding in verbatimScientificName separated verbatimLocality into two columns removed coordinates in verbatimLocality character encoding in associatedTaxa

WIS-L-0012073_lg character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012074_lg character encoding in verbatimCoordinates character encoding in habitat misspelling in verbatimLocality

WIS-L-0012075_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012076_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimLongitude

WIS-L-0012077_lg character encoding in verbatimLocality character encoding in habitat

WIS-L-0012078_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012082_lg character encoding in verbatimScientificName character encoding in verbatimCoordinates character encoding in verbatimLatitude character encoding in verbatimEventDate character encoding in recordNumber

WIS-L-0012084_lg character encoding in verbatimScientificName separated verbatimLocality into two columns

WIS-L-0012085_lg character encoding in verbatimScientificName separated verbatimLocality into two columns character encoding in verbatimLongitude character encoding in verbatimCoordinates

WIS-L-0012086_lg character encoding in verbatimScientificName separated verbatimLocality into two columns


End New Errors


Errors noted below are fixed


D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???)
(Bryan: All "special characters should be preserved by using UTF-8)
(Ed: Accented "á" fixed)

Gold label NY01075763_lg.txt has Pyrenidium actinellurn, should be Pyrenidium actinellum. Gold Parsed copies the error verbatim (as it should) and needs to be corrected if the .txt file is corrected.
/home/aocr/datasets/lichens/gold/outputs/human/NY01075763_lg.txt fixed --Dpaul 17:28, 26 February 2013 (EST)
/home/aocr/datasets/lichens/gold/parsed/human/NY01075763_lg.csv fixed --Dpaul 17:28, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/ocr/NY01075763_lg.txt fixed --Dpaul 16:33, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075763_lg.csv fixed --Dpaul 16:33, 27 February 2013 (EST)
datasets/lichens/gold/ocr/WIS-L-0012040_lg.txt: Longitude recorded as L49 (capitalized for clarity) instead of 149
/webroot/datasets/lichens/gold/ocr/WIS-L-0012040_lg.txt fixed --Dpaul 16:39, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/WIS-L-0012040_lg.csv fixed --Dpaul 16:39, 27 February 2013 (EST)

Unicode Reserved character (single quote)

The following files use Unicode Character 'PRIVATE USE TWO' (U+0092) as a single quote mark

  • NY_00617142.txt
  • NY_01334334.txt
/webroot/datasets/herb/gold/ocr/NY_00617142.txt fixed --Dpaul 16:59, 27 February 2013 (EST)
/webroot/datasets/herb/gold/ocr/NY_01334334.txt fixed --Dpaul 16:59, 27 February 2013 (EST)

Right single Quote

The following files contain the unicode character u+2019, Right Single Quotation Mark

  • datasets/lichens/gold/ocr/NY01075760_lg.txt
  • datasets/lichens/gold/ocr/NY01075761_lg.txt
  • datasets/lichens/gold/ocr/NY01075761_lg.txt
  • datasets/lichens/gold/ocr/NY01075762_lg.txt
  • datasets/lichens/gold/ocr/NY01075764_lg.txt
  • datasets/lichens/gold/ocr/NY01075768_lg.txt
  • datasets/lichens/gold/ocr/NY01075768_lg.txt
  • datasets/lichens/gold/ocr/NY01075770_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075771_lg.txt
  • datasets/lichens/gold/ocr/NY01075776_lg.txt
  • datasets/lichens/gold/ocr/NY01075777_lg.txt
  • datasets/lichens/gold/ocr/NY01075779_lg.txt
  • datasets/lichens/gold/ocr/NY01075779_lg.txt
  • datasets/lichens/gold/ocr/NY01075781_lg.txt
  • datasets/lichens/gold/ocr/NY01075785_lg.txt
  • datasets/lichens/gold/ocr/NY01075785_lg.txt
  • datasets/lichens/gold/ocr/NY01075786_lg.txt
  • datasets/lichens/gold/ocr/NY01075786_lg.txt
  • datasets/lichens/gold/ocr/NY01075787_lg.txt
  • datasets/lichens/gold/ocr/NY01075787_lg.txt
  • datasets/lichens/gold/ocr/NY01075788_lg.txt
  • datasets/lichens/gold/ocr/NY01075788_lg.txt
  • datasets/lichens/gold/ocr/NY01075789_lg.txt
  • datasets/lichens/gold/ocr/NY01075789_lg.txt
  • datasets/lichens/gold/ocr/NY01075797_lg.txt
  • datasets/lichens/gold/ocr/NY01075798_lg.txt
  • datasets/lichens/gold/ocr/NY01075812_lg.txt
  • datasets/lichens/gold/ocr/NY01075817_lg.txt
  • datasets/lichens/gold/ocr/NY01075818_lg.txt
  • datasets/lichens/gold/ocr/NY01075819_lg.txt
  • datasets/lichens/gold/ocr/NY01075820_lg.txt
  • datasets/lichens/gold/ocr/NY01075821_lg.txt
  • datasets/lichens/gold/ocr/NY01075821_lg.txt
  • datasets/lichens/gold/ocr/NY01075822_lg.txt
  • datasets/lichens/gold/ocr/NY01075828_lg.txt
  • datasets/lichens/gold/ocr/NY01075829_lg.txt
  • datasets/lichens/gold/ocr/NY01075830_lg.txt
  • datasets/lichens/gold/ocr/NY01075831_lg.txt
  • datasets/lichens/gold/ocr/TENN-L-0000059_lg.txt
  • datasets/lichens/gold/ocr/TENN-L-0000073_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011728_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011730_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0011736_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012033_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012035_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012039_lg.txt
  • datasets/lichens/gold/ocr/WIS-L-0012082_lg.txt
/webroot/datasets/lichens/gold/ocr above files in this directory all fixed --Dpaul 17:22, 27 February 2013 (EST)

Right Double Quote

The following files contain the unicode character u+201D, Right Double Quotation Mark

  • datasets/lichens/gold/ocr/WIS-L-0012053_lg.txt
    • fixed --Dpaul 15:51, 27 February 2013 (EST)

Parse file errors

Inconsistency in Gold Parsed decimalLatitude and decimalLongitude in many labels. All omitted from NYBG lichens and Tennesee lichens. Gold Parsed WIS-L-0011728_lg.csv has decimalLatitude & decimalLongitude rounded to 3 decimal digits (e.g. 60.467). WIS-L-0011729_lg.csv has decimalLatitude rounded to 2 decimal digits (60.15), decimalLongitude rounded to 1 decimal digit (-152.6). Typical of variations found throughout the files. It's possible that trailing zeros were just stripped off, but this inconsistency makes it impossible to match all the labels with a parsing program.
Alex will change the metrics to avoid counting off for stripped trailing zeroes. --Dpaul 15:36, 27 February 2013 (EST)
Inconsistency in capitalization of verbatim fields in many Gold Parsed lichens. Example: NY01075763_lg.csv. In the label and OCR text the county is capitalized as ST. FRANCOIS, but in NY01075763_lg.csv it is title case: St. Francois. The state MISSOURI is capitalized in both the .txt and the .csv file. The scoring program is case sensitive, so any difference between the gold .csv and the program generated .csv will be marked wrong.
Alex will change the metrics to be case-insensitive. --Dpaul 17:28, 26 February 2013 (EST)
Gold Parsed NY01075759_lg.csv: verbatimEventDate is 1998-04-19, should be 19 April 1998.
/home/aocr/datasets/lichens/gold/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/home/aocr/datasets/lichens/silver/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075759_lg.csv fixed --Dpaul 17:35, 27 February 2013 (EST)
Gold Parsed NY01075759_lg.csv: eventDate is 4/19/1998, should be 1998-04-19 according to Darwin Core (http://rs.tdwg.org/dwc/terms/#eventDate).
/home/aocr/datasets/lichens/gold/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/home/aocr/datasets/lichens/silver/parsed/human/NY01075759_lg.csv fixed --Dpaul 18:06, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075759_lg.csv fixed --Dpaul 17:35, 27 February 2013 (EST)
/webroot/datasets/lichens/silver/parsed/NY01075759_lg.csv okay --Dpaul 17:35, 27 February 2013 (EST)
Gold Parsed NY01075770_lg.csv omits collector number, but should be 852.
/home/aocr/datasets/lichens/gold/parsed/human/NY01075770_lg.csv fixed --Dpaul 18:18, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075770_lg.csv fixed --Dpaul 17:38, 27 February 2013 (EST)
Gold OCR NY01075786_lg.txt has "(Ach.) Mil'll. Arg.", but on the image label it is "(Ach.) Müll. Arg." This error is carried to the Gold Parsed .csv file (which should be corrected if the .txt file is corrected).
/home/aocr/datasets/lichens/gold/outputs/human/NY01075786_lg.txt fixed --Dpaul 18:28, 26 February 2013 (EST)
/home/aocr/datasets/lichens/gold/parsed/human/NY01075786_lg.csv fixed --Dpaul 18:28, 26 February 2013 (EST)
/webroot/datasets/lichens/gold/parsed/NY01075786_lg.csv fixed --Dpaul 17:52, 27 February 2013 (EST)
/webroot/datasets/lichens/gold/ocr/NY01075786_lg.txt fixed --Dpaul 17:52, 27 February 2013 (EST)


Label image NY01075760_lg.jpg had a spec of dirt next to "F. Berger", introducing an apostrophe as "Kocourkova & 'F. Berger" in the Gold OCR. Gold Parsed NY01075760_lg.csv corrected "Kocourkova & 'F. Berger" back to "Kocourkova & F. Berger", omitting the apostrophe. Probably a valid correction, but not in a verbatim field.
/home/aocr/webroot/datasets/lichens/gold/parsed/NY01075760_lg.csv changed gold parsed aocr:verbatimScientificName to include the apostrophe to be consistent for verbatim field. fixed --Dpaul 16:06, 27 February 2013 (EST)

Back to the Hackathon Wiki