Example of trivial transformations on INHS fish dataset

From iDigBio
Jump to navigation Jump to search


The Illinois Natural History Survey (INHS) fish collection has graciously shared the 105,742 specimen records that is going to be used in this transformation example. The records were extracted as a Comma-Separated Values (CSV) file from the INHS FileMaker Pro database, and all specimen records were provided with a Globally Unique IDentifier (GUID) placed into the field called "dwc:occurrenceID". The GUID technology chosen by the INHS collection managers was an HTTP-based URI of the form:


Once the data was received, the first step was to verify the uniqueness of the identifier (GUID). Checking can be quickly performed either using the unique filter of Excel or using the Unix 'uniq' command. This dataset was perfect with respect to this aspect.

Mapping terms to standard terms

The next step consisted of going through each field of the dataset, to gather information about the meaning of each field in order to properly map into a standard term. The conclusions of this exchange resulted in the following transformations:

  1. Mapped 'latitude'/'longitude' fields to 'verbatimLatitude' and 'verbatimLongitude' because not all lat/long values were in decimal format. Ideally, all coordinates should be converted to decimal values and put into the 'dwc:decimalLatitude/dwc:decimalLongitude' fields.
  2. Remove the 'GIS_Latitude_IL'/'GIS_Longitude_IL' since those coordinates have not been confirmed to be accurate
  3. Mapped 'specimen_remarks' to 'dwc:occurrenceRemarks' and 'remarks' to 'dwc:locationRemarks' since the latter are comments specific to a collecting location and the former are specific to a species collected at a location.
  4. Since there are no appropriate DwC terms for water-based locations, and MISC makes use of a hierarchical/ranked data model (not a flat data model), new terms ('inhs:location_Stream', 'inhs:location_RiverMile', and 'inhs:location_Basin') were created in INHS namespace. Streams represent the more specific location where the collection of the specimen took place, with river mile indicating the mileage along the stream, and basin being the larger water body.
  5. Concatenated all the water-based locations (separated by comma) into 'dwc:waterBody'.
  6. Created a new term 'inhs:locationTrs' to store Township Range Section (TRS) information. This term will also be recommended to get into MISC.
  7. Concatenated information from 'dwc:day', 'dwc:month', and 'dwc:year' whenever possible into 'dwc:eventDate' using the ISO 8601 format (YYYY-MM-DD) and including date ranges (YYYY-MM-DD/YYYY-MM-DD). Cases where it was not possible to generate an event date, included imprecise values such as 'Fall' or 'Spring'.
  8. Mapped the count of specimens that received special preparation into the MISC term 'idigbio:preparationCount'.
  9. Transformed and merged the acronyms used to indicate the various levels of species endangerment into the MISC term 'idigbio:endangeredStatus'. The acronym mappings used were: SE=State Endangered, FE=Federally Endangered, ST=State Threatened, FT=Federally Threatened, I=Introduced
  10. All other information had trivial mappings into DwC terms, namely 'dwc:institutionCode', 'dwc:catalogNumber', 'dwc:family', 'dwc:scientificName', 'dwc:identifiedBy', 'dwc:locality', 'dwc:verbatimLocality', 'dwc:county', 'dwc:state', 'dwc:country', 'dwc:day', 'dwc:month', 'dwc:year', 'dwc:recordedBy', 'dwc:individualCount', 'dwc:preparation', 'dwc:fieldNumber', 'dwc:typeStatus'.

A second round of edits were made on this dataset to improve the searchability in the aggregate. Many of these field values might be considered boilerplate. They help a data record stand out when searched by people without daily familiarity with this particular organismal group. For example:

  1. Added collectionCode: when only the institution code is given, this field sets the dataset apart from its neighbours in other organismal groups, but in the same institution.
  2. Added values to higher level taxonomic ranks: kingdom, phylum, class, order.
  3. Parsed out the scientific name field into its individual parts: genus, specificEpithet, infraspecificEpithet, taxonRemarks.
  4. Further teased apart details of specimen identification, e.g, cf., aff. : identificationQualifier, identificationRank.
  5. Added vernacularName to accommodate identifications that were not taxonomic.
  6. Added more details to the locality: continent, countryCode.
  7. Performed some minor grooming to determine which lat/lon fields were DMS and which were Dec format: decimalLatitude and decimalLongitude.
  8. Where relevant, and the taxonomic ranks are not provided by DwC, added the values of those ranks to a higherClassification field, e.g., tribe, superfamily.

This same approach to data enrichment was applied also to the INHS crustacean and mollusk datasets.

Final Product

At the end of the process above, we created a sip file that was submitted for ingestion. Check out what the contents of the file look like.

Go back to: Data Ingestion Guidance

Go back to: CYWG page