Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
Line 57: Line 57:
==Complete Attribution==
==Complete Attribution==
In order for the data to be correctly attritubed to the provider, the following are important to complete:
In order for the data to be correctly attritubed to the provider, the following are important to complete:
*fill in the institutionCode field in the dataset using the correct code.
*fill in the dwc:institutionCode field in the dataset using the correct code.
*go to [GRBio.org] to get their coolID value to store in the dwc:collectionID and dwc:institutionID fields
*go to [GRBio.org] to get their coolID value to store in the dwc:collectionID and dwc:institutionID fields
==Permission to ingest and licensing==
==Permission to ingest and licensing==
*you need to have permission to submit the data
*you need to have permission to submit the data

Revision as of 10:19, 14 January 2014

Guidance To Data providers When First Considering iDigBio Data Ingestion

Audience: Data Providers, iDigBio data ingestion staff

This is the process description for

  • the iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching.
  • data providers to follow to assure that data are efficiently and accurately provided to the iDigBio staff.

Contact Info

If you find yourself in need assistance, contact data@idigbio.org

Process Terminology

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

IngestionProcess.gif
  • negotiating - in the process of determining provider's interest in data ingestion
    • begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
    • open a Redmine ticket in category=Data Mobilizing
    • ends with data exported by provider, ready for inspection and ingestion
  • mobilizing - in the process of evaluating data being fit for ingestion
    • begins with provider exported data and cursory inspection
    • ends with data passing inspection and passing to ingesting state, Redmine ticket changes to category=Data
  • ingesting - in the process of ingesting provider's data
    • begins with Redmine ticket change to category=Data
    • ends with
      • data successfully ingested, ready for consumption
      • report sent back to data mobilizing staff
      • report sent to provider
      • Redmine ticket set to Status= Closed
  • evaluating - in the process of evaluating a failure to be ingested
    • begins with ingestion failure
      • evaluate ingestion failure, if data error - send it back to mobilizing state for corrections or
      • evaluate ingestion failure, if ingestion error - make corrections
    • ends with data re-submission to ingesting state

Data Requirements for Data Providers

Below are what we ask of the data to make it easily searchable in the cyberinfrastructure we provide.

There are 3 kinds of data files to submit for ingestion:

  1. specimen data with metadata
  2. media data related to and attached by reference to specimen records with metadata
  3. media files - e.g., non-archival .jpgs (see acceptable format here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations)

Packaging for specimen data

In order of preference:

  • DwC-A (Darwin Core Archives)
  • Symbiota feed
  • custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names)

Sending data to us

  • an RSS feed for ready access and update is recommended
  • email the files to us

Specimen metadata

  • Each specimen record needs to have a unique (within the dataset) identifier in the dwc:occurrenceID field. When the ingestion software detects duplicate identifiers, the records are flagged as an error and are not ingested.
  • If using a custom CSV, use field names that are as close to Darwin Core as possible, additionally make use of the MISC field names (local iDigbio extensions to DarwinCore). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are not indexed and are not searchable.

Complete Attribution

In order for the data to be correctly attritubed to the provider, the following are important to complete:

  • fill in the dwc:institutionCode field in the dataset using the correct code.
  • go to [GRBio.org] to get their coolID value to store in the dwc:collectionID and dwc:institutionID fields

Permission to ingest and licensing

  • you need to have permission to submit the data
  • fill in the DwC record-level fields for intellectual property and licensing, and be sure to fill in your official institution code (dwc:InstitutionCode) (see and update http://GRBio.org)

Data recommendations for optimal searchability

  • put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22
  • put elevation in METERS units in the elevation field without the units (e.g., the fields minimumElevationInMeters and maximumElevationInMeters already assume the numeric values are in meters, so no need to include the units with the data)
  • do not use unescaped newline characters
  • no '0' in fields to represent no value, e.g., lat or lon
  • lat and lon coordinates need to be in decimal, and no N, S, E, W
  • parse genus, species, infraspecific epithet if already aggregated into a scientific name
  • include parsed higher taxonomy, at least kingdom and family if possible

Packaging for images/media objects

  • each media record needs to have a GUID: a persistent globally unique identifier or at least a unique (within the dataset) identifier in the occurrenceID field.
    • if submitting media records with specimen data records, put the specimen occurrenceID in the associatedSpecimenReference field in the image metadata csv
  • use Audubon Core metadata, with one record to go with each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable.
  • just like the ownership of catalog records, the media records need to provided freely and with permission, and each record needs to have at least Creative Commons permission = "CC BY"

Error Handling

When data are received from the provider during the mobilizing stage, they are evaluated for fitness. When the evaluation is successful, the ingestion process moves from mobilizing to ingesting, and they are submitted to the ingestion scripts. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the mobilizing staff re-submit the data to the ingesting staff.

Use Case for Media and Data

2 CSVs (scenario: one for image metadata, one for specimen data, plus uploaded images with > 1 image per specimen)

  • Put specimen occurrenceID in the field associatedSpecimenReference in the image metadata CSV (whose schema is in Audubon Core format)
  • Generate a meta.xml file by hand and package up the files in a DwC-A like format. (No eml.xml required).

Sample Scenarios of Data Transformations to Prepare Data for Ingestion

Additional References

If you want to learn about acceptable Creative Commons licenses in iDigBio: