Data Quality Toolkit 2024: Difference between revisions

(9 intermediate revisions by 2 users not shown)
Line 9: Line 9:
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags].
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags].


If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[OpenRefine Data Quality Toolkit|OpenRefine]], [[Specify Data Quality Toolkit|Specify]], [[Symbiota Data Quality Toolkit|Symbiota]], [[TaxonWorks Data Quality Toolkit|TaxonWorks]].
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[Specify Data Quality Toolkit|Specify]], [https://biokic.github.io/symbiota-docs/editor/quality/ Symbiota], [[TaxonWorks Data Quality Toolkit|TaxonWorks]]. Additional command line tools can be found in Bob Mesibov's [https://www.datafix.com.au/darwin-core-checker/ Darwin Core Checker tool].


== Catalog Numbers and Other Identifiers==
== Catalog Numbers and Other Identifiers==
Line 17: Line 17:


'''Solutions:'''
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Duplicate Catalog Numbers|Arctos]]
* [https://handbook.arctosdb.org/documentation/catalog.html#catalog-number Arctos]
* [[Excel Data Quality Toolkit#Duplicate Catalog Numbers|Excel]]
* [[Excel Data Quality Toolkit#Duplicate Catalog Numbers|Excel]]
* [[OpenRefine Data Quality Toolkit#Duplicate Catalog Numbers|OpenRefine]]
* [[OpenRefine Data Quality Toolkit#Duplicate Catalog Numbers|OpenRefine]]
* [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]]
* [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]]
* [[Symbiota Data Quality Toolkit#Duplicate Catalog Numbers|Symbiota]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#duplicate-catalog-numbers Symbiota]
* [[TaxonWorks Data Quality Toolkit#Duplicate Catalog Numbers|TaxonWorks]]
* [[TaxonWorks Data Quality Toolkit#Duplicate Catalog Numbers|TaxonWorks]]


Line 32: Line 32:
* [[Arctos Data Quality Toolkit#Date Hasn't Happened Yet|Arctos]]
* [[Arctos Data Quality Toolkit#Date Hasn't Happened Yet|Arctos]]
* [[Excel Data Quality Toolkit#Date Hasn't Happened Yet|Excel]]
* [[Excel Data Quality Toolkit#Date Hasn't Happened Yet|Excel]]
* [[OpenRefine Data Quality Toolkit#DDate Hasn't Happened Yet|OpenRefine]]
* [[OpenRefine Data Quality Toolkit#Date Hasn't Happened Yet|OpenRefine]]
* [[Specify Data Quality Toolkit#Date Hasn't Happened Yet|Specify]]
* [[Specify Data Quality Toolkit#Date Hasn't Happened Yet|Specify]]
* [[Symbiota Data Quality Toolkit#Date Hasn't Happened Yet|Symbiota]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-hasnt-happened-yet Symbiota]
* [[TaxonWorks Data Quality Toolkit#Date Hasn't Happened Yet|TaxonWorks]]
* [[TaxonWorks Data Quality Toolkit#Date Hasn't Happened Yet|TaxonWorks]]


Line 46: Line 46:
* [[OpenRefine Data Quality Toolkit#Date is Suspiciously Old|OpenRefine]]
* [[OpenRefine Data Quality Toolkit#Date is Suspiciously Old|OpenRefine]]
* [[Specify Data Quality Toolkit#Date is Suspiciously Old|Specify]]
* [[Specify Data Quality Toolkit#Date is Suspiciously Old|Specify]]
* [[Symbiota Data Quality Toolkit#Date is Suspiciously Old|Symbiota]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-is-suspiciously-old Symbiota]
* [[TaxonWorks Data Quality Toolkit#Date is Suspiciously Old|TaxonWorks]]
* [[TaxonWorks Data Quality Toolkit#Date is Suspiciously Old|TaxonWorks]]


Line 53: Line 53:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Identified Date Earlier than Collected Date|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Identified Date Earlier than Collected Date|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Identified Date Earlier than Collected Date|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Identified Date Earlier than Collected Date|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#identified-date-earlier-than-collected-date Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Identified Date Earlier than Collected Date|TaxonWorks]]


=== Year, Month, and Day Values Do Not Match Date ===
=== Year, Month, and Day Values Do Not Match Date ===
Line 64: Line 64:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#year-month-and-day-values-do-not-match-date Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|TaxonWorks]]


== Geography ==
== Geography ==
Line 77: Line 77:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Coordinates are Zero|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Coordinates are Zero|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Coordinates are Zero|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Coordinates are Zero|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-are-zero Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Coordinates are Zero|TaxonWorks]]


=== Coordinates Do Not Fall Within Named Geographic Unit ===
=== Coordinates Do Not Fall Within Named Geographic Unit ===
Line 88: Line 88:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-do-not-fall-within-named-geographic-unit Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|TaxonWorks]]


=== Georeference Metadata with no Associated Georeference ===
=== Georeference Metadata with no Associated Georeference ===
Line 99: Line 99:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Georeference Metadata with no Associated Georeference|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#georeference-metadata-with-no-associated-georeference Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Georeference Metadata with no Associated Georeference|TaxonWorks]]


=== Elevation is Unlikely ===
=== Elevation is Unlikely ===
Line 110: Line 110:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Elevation is Unlikely|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Elevation is Unlikely|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Elevation is Unlikely|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Elevation is Unlikely|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#elevation-is-unlikely Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Elevation is Unlikely|TaxonWorks]]


=== Improperly Negated Latitudes/Longitudes ===
=== Improperly Negated Latitudes/Longitudes ===
Line 121: Line 121:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#improperly-negated-latitudeslongitudes Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|TaxonWorks]]


=== Invalid Coordinates ===
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Invalid Coordinates|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Invalid Coordinates|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Invalid Coordinates|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Invalid Coordinates|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-coordinates Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Invalid Coordinates|TaxonWorks]]


=== Lower Geography Values are Provided, but No Higher Geography ===
=== Lower Geography Values are Provided, but No Higher Geography ===
Line 144: Line 143:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#lower-geography-values-are-provided-but-no-higher-geography Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|TaxonWorks]]


=== Minimum and Maximum Elevation Values Mismatched ===
=== Minimum and Maximum Elevation Values Mismatched ===
Line 155: Line 154:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#minimum-and-maximum-elevation-values-mismatched Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|TaxonWorks]]


=== Mismatched Country and CountryCode Values ===
=== Mismatched Country and CountryCode Values ===
Line 166: Line 165:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Mismatched Country and CountryCode Values|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Mismatched Country and CountryCode Values|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Mismatched Country and CountryCode Values|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Mismatched Country and CountryCode Values|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-country-and-countrycode-values Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Mismatched Country and CountryCode Values|TaxonWorks]]


=== Mismatched Geographic Terms ===
=== Mismatched Geographic Terms ===
Line 177: Line 176:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Mismatched Geographic Terms|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Mismatched Geographic Terms|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Mismatched Geographic Terms|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Mismatched Geographic Terms|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-geographic-terms Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Mismatched Geographic Terms|TaxonWorks]]


=== Missing Geodetic Datum ===
=== Missing Geodetic Datum ===
Line 188: Line 187:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Missing Geodetic Datum|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Missing Geodetic Datum|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Missing Geodetic Datum|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Missing Geodetic Datum|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-geodetic-datum Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Missing Geodetic Datum|TaxonWorks]]


=== Missing Latitudes/Longitudes ===
=== Missing Latitudes/Longitudes ===
Line 199: Line 198:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Missing Latitudes/Longitudes|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Missing Latitudes/Longitudes|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Missing Latitudes/Longitudes|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Missing Latitudes/Longitudes|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-latitudeslongitudes Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Missing Latitudes/Longitudes|TaxonWorks]]


=== Misspelled Geographic Unit Names ===
=== Misspelled Geographic Unit Names ===
Line 210: Line 209:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Misspelled Geographic Unit Names|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Misspelled Geographic Unit Names|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Misspelled Geographic Unit Names|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Misspelled Geographic Unit Names|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-geographic-unit-names Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Misspelled Geographic Unit Names|TaxonWorks]]


== Taxonomy ==
== Taxonomy ==
Line 223: Line 222:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-or-invalid-taxonomic-names Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|TaxonWorks]]


=== Unknown Higher Taxonomy ===
=== Unknown Higher Taxonomy ===
Line 234: Line 233:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Unknown Higher Taxonomy|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Unknown Higher Taxonomy|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Unknown Higher Taxonomy|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Unknown Higher Taxonomy|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#unknown-higher-taxonomy Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Unknown Higher Taxonomy|TaxonWorks]]


== Other Issues ==
== Other Issues ==
Line 247: Line 246:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Incorrect Character Encodings|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Incorrect Character Encodings|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Incorrect Character Encodings|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Incorrect Character Encodings|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-character-encodings Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Incorrect Character Encodings|TaxonWorks]]


=== Incorrect Line Endings ===
=== Incorrect Line Endings ===
Line 258: Line 257:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Incorrect Line Endings|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Incorrect Line Endings|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Incorrect Line Endings|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Incorrect Line Endings|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-line-endings Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Incorrect Line Endings|TaxonWorks]]


=== Invalid Individual Count ===
=== Invalid Individual Count ===
Line 269: Line 268:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Invalid Individual Count|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Invalid Individual Count|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Invalid Individual Count|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Invalid Individual Count|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-individual-count Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Invalid Individual Count|TaxonWorks]]


=== Non-standardized BasisOfRecord Values ===
=== Non-standardized BasisOfRecord Values ===
Line 284: Line 283:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Non-standardized BasisOfRecord Values|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Non-standardized BasisOfRecord Values|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Non-standardized BasisOfRecord Values|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Non-standardized BasisOfRecord Values|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#non-standardized-basisofrecord-values Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Non-standardized BasisOfRecord Values|TaxonWorks]]
245

edits