Data Quality Toolkit 2024: Difference between revisions

No edit summary
(17 intermediate revisions by 2 users not shown)
Line 7: Line 7:
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.


This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook].
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags].


If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[OpenRefine Data Quality Toolkit|OpenRefine]], [[Specify Data Quality Toolkit|Specify]], [[Symbiota Data Quality Toolkit|Symbiota]], [[TaxonWorks Data Quality Toolkit|TaxonWorks]].
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[Specify Data Quality Toolkit|Specify]], [https://biokic.github.io/symbiota-docs/editor/quality/ Symbiota], [[TaxonWorks Data Quality Toolkit|TaxonWorks]]. Additional command line tools can be found in Bob Mesibov's [https://www.datafix.com.au/darwin-core-checker/ Darwin Core Checker tool].


== Catalog Numbers and Other Identifiers==
== Catalog Numbers and Other Identifiers==
Line 17: Line 17:


'''Solutions:'''
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Duplicate Catalog Numbers|Arctos]]
* [https://handbook.arctosdb.org/documentation/catalog.html#catalog-number Arctos]
* Excel
* [[Excel Data Quality Toolkit#Duplicate Catalog Numbers|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Duplicate Catalog Numbers|OpenRefine]]
* [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]]
* [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]]
* [[Symbiota Data Quality Toolkit#Duplicate Catalog Numbers|Symbiota]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#duplicate-catalog-numbers Symbiota]
* [[TaxonWorks Data Quality Toolkit#Duplicate Catalog Numbers|TaxonWorks]]
* [[TaxonWorks Data Quality Toolkit#Duplicate Catalog Numbers|TaxonWorks]]


== Dates ==
== Dates ==
=== Date Hasn't Happened Yet ===
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is in the future.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Date Hasn't Happened Yet|Arctos]]
* [[Excel Data Quality Toolkit#Date Hasn't Happened Yet|Excel]]
* [[OpenRefine Data Quality Toolkit#Date Hasn't Happened Yet|OpenRefine]]
* [[Specify Data Quality Toolkit#Date Hasn't Happened Yet|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-hasnt-happened-yet Symbiota]
* [[TaxonWorks Data Quality Toolkit#Date Hasn't Happened Yet|TaxonWorks]]
=== Date is Suspiciously Old ===
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Date is Suspiciously Old|Arctos]]
* [[Excel Data Quality Toolkit#Date is Suspiciously Old|Excel]]
* [[OpenRefine Data Quality Toolkit#Date is Suspiciously Old|OpenRefine]]
* [[Specify Data Quality Toolkit#Date is Suspiciously Old|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-is-suspiciously-old Symbiota]
* [[TaxonWorks Data Quality Toolkit#Date is Suspiciously Old|TaxonWorks]]


=== Identified Date Earlier than Collected Date ===
=== Identified Date Earlier than Collected Date ===
Line 30: Line 53:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Identified Date Earlier than Collected Date|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Identified Date Earlier than Collected Date|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Identified Date Earlier than Collected Date|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Identified Date Earlier than Collected Date|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#identified-date-earlier-than-collected-date Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Identified Date Earlier than Collected Date|TaxonWorks]]
 
=== Year, Month, and Day Values Do Not Match Date ===
'''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Arctos]]
* [[Excel Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Excel]]
* [[OpenRefine Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|OpenRefine]]
* [[Specify Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#year-month-and-day-values-do-not-match-date Symbiota]
* [[TaxonWorks Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|TaxonWorks]]


== Geography ==
== Geography ==
=== Coordinates are Zero ===
'''Problem:''' The provided latitude and longitude values are 0.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Coordinates are Zero|Arctos]]
* [[Excel Data Quality Toolkit#Coordinates are Zero|Excel]]
* [[OpenRefine Data Quality Toolkit#Coordinates are Zero|OpenRefine]]
* [[Specify Data Quality Toolkit#Coordinates are Zero|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-are-zero Symbiota]
* [[TaxonWorks Data Quality Toolkit#Coordinates are Zero|TaxonWorks]]
=== Coordinates Do Not Fall Within Named Geographic Unit ===
'''Problem:''' The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Arctos]]
* [[Excel Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Excel]]
* [[OpenRefine Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|OpenRefine]]
* [[Specify Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-do-not-fall-within-named-geographic-unit Symbiota]
* [[TaxonWorks Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|TaxonWorks]]


=== Georeference Metadata with no Associated Georeference ===
=== Georeference Metadata with no Associated Georeference ===
Line 43: Line 99:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Georeference Metadata with no Associated Georeference|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#georeference-metadata-with-no-associated-georeference Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Georeference Metadata with no Associated Georeference|TaxonWorks]]
 
=== Elevation is Unlikely ===
'''Problem:''' Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Elevation is Unlikely|Arctos]]
* [[Excel Data Quality Toolkit#Elevation is Unlikely|Excel]]
* [[OpenRefine Data Quality Toolkit#Elevation is Unlikely|OpenRefine]]
* [[Specify Data Quality Toolkit#Elevation is Unlikely|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#elevation-is-unlikely Symbiota]
* [[TaxonWorks Data Quality Toolkit#Elevation is Unlikely|TaxonWorks]]


=== Improperly Negated Latitudes/Longitudes ===
=== Improperly Negated Latitudes/Longitudes ===
Line 54: Line 121:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#improperly-negated-latitudeslongitudes Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|TaxonWorks]]
 
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Invalid Coordinates|Arctos]]
* [[Excel Data Quality Toolkit#Invalid Coordinates|Excel]]
* [[OpenRefine Data Quality Toolkit#Invalid Coordinates|OpenRefine]]
* [[Specify Data Quality Toolkit#Invalid Coordinates|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-coordinates Symbiota]
* [[TaxonWorks Data Quality Toolkit#Invalid Coordinates|TaxonWorks]]
 
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Arctos]]
* [[Excel Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Excel]]
* [[OpenRefine Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|OpenRefine]]
* [[Specify Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#lower-geography-values-are-provided-but-no-higher-geography Symbiota]
* [[TaxonWorks Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|TaxonWorks]]


=== Minimum and Maximum Elevation Values Mismatched ===
=== Minimum and Maximum Elevation Values Mismatched ===
Line 65: Line 154:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#minimum-and-maximum-elevation-values-mismatched Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|TaxonWorks]]
 
=== Mismatched Country and CountryCode Values ===
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Mismatched Country and CountryCode Values|Arctos]]
* [[Excel Data Quality Toolkit#Mismatched Country and CountryCode Values|Excel]]
* [[OpenRefine Data Quality Toolkit#Mismatched Country and CountryCode Values|OpenRefine]]
* [[Specify Data Quality Toolkit#Mismatched Country and CountryCode Values|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-country-and-countrycode-values Symbiota]
* [[TaxonWorks Data Quality Toolkit#Mismatched Country and CountryCode Values|TaxonWorks]]
 
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Mismatched Geographic Terms|Arctos]]
* [[Excel Data Quality Toolkit#Mismatched Geographic Terms|Excel]]
* [[OpenRefine Data Quality Toolkit#Mismatched Geographic Terms|OpenRefine]]
* [[Specify Data Quality Toolkit#Mismatched Geographic Terms|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-geographic-terms Symbiota]
* [[TaxonWorks Data Quality Toolkit#Mismatched Geographic Terms|TaxonWorks]]
 
=== Missing Geodetic Datum ===
'''Problem:''' Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Missing Geodetic Datum|Arctos]]
* [[Excel Data Quality Toolkit#Missing Geodetic Datum|Excel]]
* [[OpenRefine Data Quality Toolkit#Missing Geodetic Datum|OpenRefine]]
* [[Specify Data Quality Toolkit#Missing Geodetic Datum|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-geodetic-datum Symbiota]
* [[TaxonWorks Data Quality Toolkit#Missing Geodetic Datum|TaxonWorks]]


=== Missing Latitudes/Longitudes ===
=== Missing Latitudes/Longitudes ===
Line 76: Line 198:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Missing Latitudes/Longitudes|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Missing Latitudes/Longitudes|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Missing Latitudes/Longitudes|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Missing Latitudes/Longitudes|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-latitudeslongitudes Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Missing Latitudes/Longitudes|TaxonWorks]]


=== Misspelled Geographic Unit Names ===
=== Misspelled Geographic Unit Names ===
Line 87: Line 209:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Misspelled Geographic Unit Names|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Misspelled Geographic Unit Names|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Misspelled Geographic Unit Names|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Misspelled Geographic Unit Names|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-geographic-unit-names Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Misspelled Geographic Unit Names|TaxonWorks]]


== Taxonomy ==
== Taxonomy ==


=== Misspelled Taxonomic Names ===
=== Misspelled or Invalid Taxonomic Names ===
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-or-invalid-taxonomic-names Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|TaxonWorks]]
 
=== Unknown Higher Taxonomy ===
'''Problem:''' Species may be missing higher taxonomic information.
 
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Unknown Higher Taxonomy|Arctos]]
* [[Excel Data Quality Toolkit#Unknown Higher Taxonomy|Excel]]
* [[OpenRefine Data Quality Toolkit#Unknown Higher Taxonomy|OpenRefine]]
* [[Specify Data Quality Toolkit#Unknown Higher Taxonomy|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#unknown-higher-taxonomy Symbiota]
* [[TaxonWorks Data Quality Toolkit#Unknown Higher Taxonomy|TaxonWorks]]


== Other Issues ==
== Other Issues ==
=== Incorrect Character Encodings ===
'''Problem:''' Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Incorrect Character Encodings|Arctos]]
* [[Excel Data Quality Toolkit#Incorrect Character Encodings|Excel]]
* [[OpenRefine Data Quality Toolkit#Incorrect Character Encodings|OpenRefine]]
* [[Specify Data Quality Toolkit#Incorrect Character Encodings|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-character-encodings Symbiota]
* [[TaxonWorks Data Quality Toolkit#Incorrect Character Encodings|TaxonWorks]]
=== Incorrect Line Endings ===
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Incorrect Line Endings|Arctos]]
* [[Excel Data Quality Toolkit#Incorrect Line Endings|Excel]]
* [[OpenRefine Data Quality Toolkit#Incorrect Line Endings|OpenRefine]]
* [[Specify Data Quality Toolkit#Incorrect Line Endings|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-line-endings Symbiota]
* [[TaxonWorks Data Quality Toolkit#Incorrect Line Endings|TaxonWorks]]
=== Invalid Individual Count ===
'''Problem:''' individualCount values may not make sense as a positive integer.
'''Solutions:'''
* [[Arctos Data Quality Toolkit#Invalid Individual Count|Arctos]]
* [[Excel Data Quality Toolkit#Invalid Individual Count|Excel]]
* [[OpenRefine Data Quality Toolkit#Invalid Individual Count|OpenRefine]]
* [[Specify Data Quality Toolkit#Invalid Individual Count|Specify]]
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-individual-count Symbiota]
* [[TaxonWorks Data Quality Toolkit#Invalid Individual Count|TaxonWorks]]


=== Non-standardized BasisOfRecord Values ===
=== Non-standardized BasisOfRecord Values ===
Line 117: Line 283:


'''Solutions:'''
'''Solutions:'''
* Arctos
* [[Arctos Data Quality Toolkit#Non-standardized BasisOfRecord Values|Arctos]]
* Excel
* [[Excel Data Quality Toolkit#Non-standardized BasisOfRecord Values|Excel]]
* OpenRefine
* [[OpenRefine Data Quality Toolkit#Non-standardized BasisOfRecord Values|OpenRefine]]
* Specify
* [[Specify Data Quality Toolkit#Non-standardized BasisOfRecord Values|Specify]]
* Symbiota
* [https://biokic.github.io/symbiota-docs/editor/quality/#non-standardized-basisofrecord-values Symbiota]
* TaxonWorks
* [[TaxonWorks Data Quality Toolkit#Non-standardized BasisOfRecord Values|TaxonWorks]]
245

edits