Data Quality Toolkit 2024: Difference between revisions

From iDigBio
Jump to navigation Jump to search
Line 97: Line 97:
=== Improperly Negated Latitudes/Longitudes ===
=== Improperly Negated Latitudes/Longitudes ===
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Solutions:'''
* Arctos
* Excel
* OpenRefine
* Specify
* Symbiota
* TaxonWorks
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.


'''Solutions:'''
'''Solutions:'''

Revision as of 15:10, 16 February 2024


Overview

This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.

This page was inspired by Bob Mesibov's Data Cleaner's Cookbook.

If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: Arctos, Excel, OpenRefine, Specify, Symbiota, TaxonWorks.

Catalog Numbers and Other Identifiers

Duplicate Catalog Numbers

Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).

Solutions:

Dates

Date Hasn't Happened Yet

Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is in the future.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Date is Suspiciously Old

Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Identified Date Earlier than Collected Date

Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Year, Month, and Day Values Do Not Match Date

Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Geography

Coordinates Do Not Fall Within Named Geographic Unit

Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Georeference Metadata with no Associated Georeference

Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Improperly Negated Latitudes/Longitudes

Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Lower Geography Values are Provided, but No Higher Geography

Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Minimum and Maximum Elevation Values Mismatched

Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Mismatched Country and CountryCode Values

Problem: The provided value for country and countryCode do not match.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Mismatched Geographic Terms

Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Missing Latitudes/Longitudes

Problem: A record has a latitude value, but not a longitude value, or vice versa.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Misspelled Geographic Unit Names

Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Taxonomy

Misspelled Taxonomic Names

Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks

Other Issues

Non-standardized BasisOfRecord Values

Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.

The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.

Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.

Solutions:

  • Arctos
  • Excel
  • OpenRefine
  • Specify
  • Symbiota
  • TaxonWorks