Revision as of 13:43, 13 January 2014

Guidance To Data Providers When First Considering iDigBio Data Ingestion

Audience: Data Providers, iDigBio data ingestion staff

This is a process description for the iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching. Some of the information contained here gives details of what the staff are looking for from the provider in terms of input, metadata standards, and file formats (e.g., .csv, .jpg, DwC-A).

Contact Info

If you find yourself here and need assistance, contact data@idigbio.org

Below are what we ask of the data to make it fit (e.g., easily searchable) for use in the cyberinfrastructure we provide:

Process Terminology

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

negotiating - in the process of evaluating provider's interest in data ingestion
- begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
- open a Redmine ticket in category=Data Mobilizing
- ends with data exported by provider, ready for inspection and ingestion
mobilizing - in the process of evaluating data being fit for ingestion
- begins with exported data and cursory inspection
- ends with data passing inspection and passing to 'ingesting state, Redmine ticket changes to category=Data
ingesting - in the process of ingesting provider's data
- begins with Redmine ticket change to category=Data
- ends with
  - data successfully being ingested
  - report sent back to data mobilizing staff
  - Redmine ticket set to Status= Closed
evaluating - in the process of evaluating a failure to be ingested
- begins with ingestion failure
  - evaluate ingestion failure, send back to mobilizing state for corrections or
  - evaluate ingestion failure, make corrections
- ends with re-submission to ingesting state

Data Requirements

3 kinds of data are required for ingestion:

specimen data with metadata
media data related to and attached by reference to specimen records
media files - e.g., non-archival, .jpg (see acceptable format here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations)

Packaging for specimen data

IPT
DwC-A
CSV (save the data in UTF-8 format to preserve diacritics)
Symbiota feeds,
plus an RSS feed for ready access and update is recommended, otherwise email the files to us
metadata:
- each specimen record needs to have a unique (within the dataset) identifier in the occurrenceID field.
- name the fields as close to Darwin Core as possible, in XML style, e.g., 'dwc:continent, and additionally use the MISC field names (local iDigbio extensions to DarwinCore)
you need to have permission to submit the data
data recommendations for optimal searchability in exported data:
- put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22
- put elevation in METERS units in the elevation field without the units (e.g., the fields minimumElevationInMeters and maximumElevationInMeters already assume the numeric values are in meters, so no need to include the units with the data)
- no unescaped newline characters
- no '0' in fields to represent no value, e.g., lat or lon
- lat and lon coordinates need to be in decimal, and no N, S, E, W
- parse genus, species, infraspecific epithet if already aggregated into a scientific name
- include parsed higher taxonomy
- save the data in UTF8 format to preserve diacritics in people and place names

Packaging for images/media objects

each media record needs to have a GUID: a persistent globally unique identifier or at least a unique (within the dataset) identifier in the occurrenceID field.
- if submitting media records with specimen data records, put the specimen data record into the ??? field
we need there to be Audubon Core metadata file, with one record to go with each media record, and we can provide coaching to help you create that file. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable.
just like the ownership of catalog records, the media records need to provided freely and with permission, and each record needs to have at least Creative Commons permission = "CC BY"

Use Case for Media and Data

2 CSV (one for image metadata, one for specimen data, plus uploaded images with > 1 image per specimen)

Put specimen occurrenceID in the field associatedSpecimenReference in the image metadata csv (whose schema is in Audubon Core format)
Generate a meta.xml file by hand and package up the files in a DwC-A like format.

(no eml.xml required).

Sample Scenarios of Data Transformations to Prepare Data for Ingestion

Additional References

If you want to learn about acceptable Creative Commons licenses in iDigBio:

https://www.idigbio.org/content/idigbio-intellectual-property-policy

@@ Line 44: / Line 44: @@
 *plus an RSS feed for ready access and update is recommended, otherwise email the files to us
 *metadata:
-** each specimen record needs to have a unique (within the dataset) identifier in the occurrenceID field.
+**each specimen record needs to have a unique (within the dataset) identifier in the occurrenceID field.
-** name the fields as close to Darwin Core as possible, in XML style, e.g., '''dwc:continent'', and additionally use the [[MISC-Authority-File-Working-Group#Data_Element_Lists_by_Data_Model_Concept|MISC field names]] (local iDigbio extensions to DarwinCore)
+**name the fields as close to Darwin Core as possible, in XML style, e.g., '''dwc:continent'', and additionally use the [[MISC-Authority-File-Working-Group#Data_Element_Lists_by_Data_Model_Concept|MISC field names]] (local iDigbio extensions to DarwinCore)
 *you need to have permission to submit the data
-*data recommendations for '''optimal searchability''':
+*data recommendations for '''optimal searchability''' in exported data:
 **put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601 ] format, i.e., YYYY-MM-DD, e.g., 2014-06-22
 **put elevation in METERS units in the elevation field without the units (e.g., the fields minimumElevationInMeters and maximumElevationInMeters already assume the numeric values are in meters, so no need to include the units with the data)
@@ Line 53: / Line 53: @@
 **no '0' in fields to represent no value, e.g., lat or lon
 **lat and lon coordinates need to be in decimal, and no N, S, E, W
-**parse out genus, species, infraspecific epithet if already aggregated into a scientific name
+**parse genus, species, infraspecific epithet if already aggregated into a scientific name
 **include parsed higher taxonomy
-**save the data in UTF8 format  when exporting for ingestion -  to preserve diacritics in people and place names
+**save the data in UTF8 format to preserve diacritics in people and place names
 ===Packaging for images/media objects===

Data Ingestion Guidance: Difference between revisions

Revision as of 13:43, 13 January 2014

Contents

Guidance To Data Providers When First Considering iDigBio Data Ingestion

Contact Info

Process Terminology

Data Requirements

Packaging for specimen data

Packaging for images/media objects

Use Case for Media and Data

Sample Scenarios of Data Transformations to Prepare Data for Ingestion

Additional References

Navigation menu

Data Ingestion Guidance: Difference between revisions

Revision as of 13:43, 13 January 2014

Guidance To Data Providers When First Considering iDigBio Data Ingestion

Contact Info

Process Terminology

Data Requirements

Packaging for specimen data

Packaging for images/media objects

Use Case for Media and Data

Sample Scenarios of Data Transformations to Prepare Data for Ingestion

Additional References

Navigation menu

Search