Data Quality Toolkit 2024: Difference between revisions

Add data quality categories
(Add data quality categories)
Line 7: Line 7:
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.


This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook].
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags].


If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[OpenRefine Data Quality Toolkit|OpenRefine]], [[Specify Data Quality Toolkit|Specify]], [[Symbiota Data Quality Toolkit|Symbiota]], [[TaxonWorks Data Quality Toolkit|TaxonWorks]].
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[OpenRefine Data Quality Toolkit|OpenRefine]], [[Specify Data Quality Toolkit|Specify]], [[Symbiota Data Quality Toolkit|Symbiota]], [[TaxonWorks Data Quality Toolkit|TaxonWorks]].
Line 119: Line 119:
=== Improperly Negated Latitudes/Longitudes ===
=== Improperly Negated Latitudes/Longitudes ===
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.


'''Solutions:'''
'''Solutions:'''
Line 163: Line 175:
=== Mismatched Geographic Terms ===
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Missing Geodetic Datum ===
'''Problem:''' Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.


'''Solutions:'''
'''Solutions:'''
Line 196: Line 219:
== Taxonomy ==
== Taxonomy ==


=== Misspelled Taxonomic Names ===
=== Misspelled or Invalid Taxonomic Names ===
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Unknown Higher Taxonomy ===
'''Problem:''' Species may be missing higher taxonomic information.


'''Solutions:'''
'''Solutions:'''
Line 208: Line 242:


== Other Issues ==
== Other Issues ==
=== Incorrect Character Encodings ===
'''Problem:''' Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Incorrect Line Endings ===
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Invalid Individual Count ===
'''Problem:''' individualCount values may not make sense as a positive integer.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks


=== Non-standardized BasisOfRecord Values ===
=== Non-standardized BasisOfRecord Values ===
255

edits