Excel Data Quality Toolkit

From iDigBio
Jump to navigation Jump to search


Overview

This toolkit contains Excel-specific resources for the Data Quality Toolkit 2024. The instructions on this page assume you are working with a Darwin Core-formatted file, for example, a spreadsheet of occurrence data exported from GBIF as a Darwin Core Archive.

Catalog Numbers and Other Identifiers

Duplicate Catalog Numbers

Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).

Solution: A simple way to identify duplicate catalog numbers is to use Conditional Formatting.

  1. Click the letter above the column that contains catalog numbers to highlight the entire column.
  2. From the Home toolbar, click "Conditional Formatting, then "Highlight Cells Rules", then "Duplicate Values...".
  3. Click OK on the pop-up window.
  4. Duplicate values for catalog number will now be highlighted in red. To view these records together, you can sort the spreadsheet by first clicking outside the top left corner of the spreadsheet (on the small triangle icon) to select the ENTIRE spreadsheet.
  5. Click the "Sort & Filter" button in the "Editing" section of the Home toolbar.
  6. Click Custom Filter.
  7. In the pop-up window, select the column containing your catalog numbers from dropdown menu under "Column". (If your data has header rows, make sure that they "My data has headers" box is checked.)
  8. In the second dropdown menu labeled Sort On, select Cell Color. Note: Depending on the size of your spreadsheet, it might take a moment for the window to load after you have selected this option.
  9. In the third dropdown menu that will appear after you select Cell Color, select the formatting color (likely red) that you want to appear at the top of your spreadsheet.
  10. Click "OK" and wait until the spreadsheet re-loads (this may take awhile). Once done, the rows with duplicate catalog numbers will show up at the top of your spreadsheet.

Dates

Date Hasn't Happened Yet

Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is in the future.

Solution: Check that dates are before today with this formula where A1 is the date you want to test.

=IF(A1>TODAY(),"ERROR","OK")

Sample file

Date is Suspiciously Old

Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.

Solution:

  1. Click the triangle icon just above and to the left of the first cell in your spreadsheet. This should select all the cells in your spreadsheet.
  2. In the "Editing" section of the Home menu, click Sort & Filter, then Custom Sort.
  3. Make sure the "My data has headers" box is checked.
  4. In the Sort By row under "Column", select the column/field that contains your date values.
  5. Under "Order", select "Oldest to Newest" from the dropdown menu. If this option is not available, this means Excel does not recognize your date column as containing dates (which can actually be easier to work with). Select "Smallest to Largest" or "A to Z" instead.
  6. If you were able to sort from Oldest to Newest, the oldest dates should now be at the top. Review them for dates that seem impossibly old for your collection. If you were not able to sort the column as a date, check both the top and the bottom of the spreadsheet for too-old dates.

Identified Date Earlier than Collected Date

Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).

Solution: Check that collection dates are before identification dates with this formula where A1 is the identification date and B1 is the collection date.

=IF(A1<B1,"ERROR","OK")

Sample file

Year, Month, and Day Values Do Not Match Date

Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.

Solution: Avoid this problem by letting Excel calculate year, month, and day for you using these formulas where A1 is the event date.

=YEAR(A1)

=MONTH(A1)

=DAY(A1)

Sample file

Geography

Coordinates are Zero

Problem: The provided latitude and longitude values are 0.

Solution: Records with zeroed coordinate values can be identified using a filter.

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your latitude values (e.g. dwc:decimalLatitude) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, change the filter criterion "Choose One" to "Equals", enter "0" in the box to the right, and select "Apply Filter".
  5. If desired, steps 3-4 can be repeated on the column containing your longitude values (e.g. dwc:decimalLongitude).

Coordinates Do Not Fall Within Named Geographic Unit

Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.

Solution: Excel does not have an easy solution for this problem.

Georeference Metadata with no Associated Georeference

Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).

Solution: Excel does not have an easy solution for this problem.

Elevation is Unlikely

Problem: Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.

Solution: Records with unlikely elevation values can be identified using a filter.

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your elevation values (e.g. dwc:minimumElevationInMeters, dwc:maximumElevationInMeters) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, change the filter criterion "Choose One" to "Between", set "Less Than" = "-11000" and "Greater Than" = "17000", and select "Apply Filter".

Improperly Negated Latitudes/Longitudes

Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.

Solution: Excel does not have an easy solution for this problem.

Invalid Coordinates

Problem: Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.

Solution: Excel does not have an easy solution for this problem.

Lower Geography Values are Provided, but No Higher Geography

Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.

Solution: Check for missing higher levels of geography with this formula where A1 is the country, B1 is the stateProvince, and C1 is the county.

=IF(ISBLANK(C1),IF(ISBLANK(B1),IF(ISBLANK(A1),"no geography","OK"),IF(ISBLANK(A1),"missing country","OK")),IF(ISBLANK(B1),IF(ISBLANK(A1),"missing state and country","missing state"),IF(ISBLANK(A1),"no country","OK")))

Sample Results
country stateProvince county check
Texas Kerr no country
USA Kerr missing state
Texas missing country
USA OK
Kerr missing state and country
USA Texas Kerr OK

Sample file

Minimum and Maximum Elevation Values Mismatched

Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).

Solution: Check that minimum elevation is below maximum elevation with this formula where A1 is the minimum elevation and B1 is the maximum elevation.

=IF(A1>B1,"ERROR","OK")

Sample file

Mismatched Country and CountryCode Values

Problem: The provided value for country and countryCode do not match.

Solution: Excel does not have an easy solution for this problem.

Mismatched Geographic Terms

Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.

Solution: Excel does not have an easy solution for this problem.

Missing Geodetic Datum

Problem: Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.

Solution: Records with missing geodetic datum values can be identified using a filter.

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your geodetic datum values (e.g. dwc:geodeticDatum) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, deselect "Select All" and then only check the "Blanks" box.

Missing Latitudes/Longitudes

Problem: A record has a latitude value, but not a longitude value, or vice versa.

Solution: Missing latitude and longitude values can be identified using a filter when at least one of the two values is present in a dataset.

To find missing latitude values:

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your latitude values (e.g. dwc:decimalLatitude) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
  5. Navigate to the column containing your longitude values (e.g. dwc:decimalLongitude) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  6. At the bottom of the pop-up window, deselect the "Blanks" box (you may need to scroll down in the pop-up window to locate it). The remaining visible rows contain records with missing latitude values.

To find missing longitude values:

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your longitude values (e.g. dwc:decimalLongitude) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
  5. Navigate to the column containing your latitude values (e.g. dwc:decimalLatitude) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  6. At the bottom of the pop-up window, deselect the "Blanks" box (you may need to scroll down in the pop-up window to locate it). The remaining visible rows contain records with missing longitude values.

Misspelled Geographic Unit Names

Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.

Solution: Avoid this issue with a pick list as data are entered. For a tutorial see Create a drop-down list

If data is already entered, check for non-standard values with this formula Where A1 is the geographic unit value being tested and the list to test against is on the geographic unit tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.

WARNING! Note the $ in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!

=VLOOKUP(A1,'geographic unit'!$A$1:$A$11,1,FALSE)

This same process can be used to solve issues. Get a list of unique values in your geographic unit column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct scientific name that should be used for every term in column A even if the two are the same.

Use a LOOKUP like the one above to get the correct scientific name for every row in your file.

=VLOOKUP(A1,'unique geographic unit'!$A$1:$B$11,2,FALSE)

Note that the unique values to check are in column A of the unique geographic unit tab and the correct replacement values are in column B of the unique geographic unit tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.

Taxonomy

Misspelled or Invalid Taxonomic Names

Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.

Solution: Avoid this issue with a pick list as data are entered. For a tutorial see Create a drop-down list

If data is already entered, check for non-standard values with this formula Where A1 is the scientific name value being tested and the list to test against is on the taxa tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.

WARNING! Note the $ in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!

=VLOOKUP(A1,'taxa'!$A$1:$A$11,1,FALSE)

This same process can be used to solve issues. Get a list of unique values in your scientific name column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct scientific name that should be used for every term in column A even if the two are the same.

Use a LOOKUP like the one above to get the correct scientific name for every row in your file.

=VLOOKUP(A1,'unique taxa'!$A$1:$B$11,2,FALSE)

Note that the unique values to check are in column A of the unique taxa tab and the correct replacement values are in column B of the unique taxa tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.

Unknown Higher Taxonomy

Problem: Species may be missing higher taxonomic information.

Solution:

  1. Open your dataset in Excel and highlight Row 1.
  2. From the Data toolbar, click "Filter".
  3. Navigate to the column containing your higher taxonomy values (e.g. dwc:order) and select the arrow icon next to the column header indicating that this column can now be filtered on.
  4. At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
  5. Resolve missing higher taxonomy as you are able.

Other Issues

Incorrect Character Encodings

Problem: Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).

Solution:

  1. In the Editing section of the Home menu, click Find & Select, then Replace.
  2. Use this resource to search for mis-rendered special characters in your dataset. For example, you could enter "Ã" into the "Find what:" field to find all mis-rendered symbols in the rightmost table, click Find Next, and then replace it with the correct value in the "Expected" column of this table.
  3. To replace all instances of a particularly common mis-rendered symbol (e.g., "é", which is mis-rendered as "é"), enter the incorrect value (é, in this example) in the "Find what:" field and the correct value (é, in this example) in the "Replace with:" field. Then click Replace All.

Incorrect Line Endings

Problem: When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.

Solution: Excel does not have an easy solution for this problem.

Invalid Individual Count

Problem: individualCount values may not make sense as a positive integer.

Solution:

  1. Click the triangle icon just above and to the left of the first cell in your spreadsheet. This should select all the cells in your spreadsheet.
  2. In the "Editing" section of the Home menu, click Sort & Filter, then Custom Sort.
  3. Make sure the "My data has headers" box is checked.
  4. In the Sort By row under "Column", select the column/field that contains your individualCount values.
  5. Keep the value "Cell Values" under "Sort On", then under "Order", select the "Largest to Smallest" option. This will bring all the non-numeric values in this field to the top of your spreadsheet, and you can review and fix them as necessary.
  6. Sort again with the "Smallest to Largest" option under Order to find any erroneous negative values.

Non-standardized BasisOfRecord Values

Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.

The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.

Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.

Solution: Avoid this issue with a pick list as data are entered. For a tutorial see Create a drop-down list

If data is already entered, check for non-standard values with this formula Where A1 is the basisOfRecord value being tested and the list to test against is on the basis of record tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.

WARNING! Note the $ in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!

=VLOOKUP(A1,'basis of record'!$A$1:$A$11,1,FALSE)

Sample Results
A B
MaterialEntity MaterialEntity
PreservedSpecimen PreservedSpecimen
FossilSpecimen FossilSpecimen
LivingSpecimen LivingSpecimen
MaterialSample MaterialSample
Preserved Specimen #N/A
HumanObservation HumanObservation
MachineObservation MachineObservation
fossil #N/A
tissue #N/A
MaterialCitation


This same process can be used to solve issues. Get a list of unique values in your basisOfRecord column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct basisOfRecord term that should be used for every term in column A even if the two are the same.

Use a LOOKUP like the one above to get the correct basisOfRecord term for every row in your file.

=VLOOKUP(A1,'unique basis of record'!$A$1:$B$11,2,FALSE)

Note that the unique values to check are in column A of the unique basis of record tab and the correct replacement values are in column B of the unique basis of record tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.

Sample Results
A B
MaterialEntity MaterialEntity
PreservedSpecimen PreservedSpecimen
FossilSpecimen FossilSpecimen
LivingSpecimen LivingSpecimen
MaterialSample MaterialSample
Preserved Specimen PreservedSpecimen
HumanObservation HumanObservation
MachineObservation MachineObservation
fossil FossilSpecimen
tissue MaterialEntity
MaterialCitation MaterialCitation

Sample file