Data Ingestion Workflow

DRAFT

Audience: iDigBio data ingestion staff

This is the process description for

the iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching.
data providers to follow to assure that data are efficiently and accurately provided to the iDigBio staff.

Contact info

If you find yourself

in need of assistance, contact data@idigbio.org
ready to discuss providing data to iDigBio, contact data@idigbio.org

Process terminology

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

negotiating - in the process of determining provider's interest in data ingestion
- begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
- open a Redmine ticket in category=Data Mobilizing
- ends with data exported by provider, ready for inspection and ingestion.
mobilizing - in the process of evaluating data being fit for ingestion
- begins with provider exported data and cursory inspection
- fill in this table with provider info: metadata.xml, unless there is a good uml.xml file available
- ends with data passing inspection and passing to ingesting state, Redmine ticket changes to category=Data

ingesting - in the process of ingesting provider's data
- begins with Redmine ticket change to category=Data
- ends with
  - data successfully ingested, ready for consumption
  - report sent back to data mobilizing staff
  - report sent to provider. Reference: Publishers Report
  - Redmine ticket set to Status= Closed

evaluating - in the process of evaluating a failure to be ingested
- begins with ingestion failure
  - evaluate ingestion failure, if data error - send it back to mobilizing state for corrections or
  - evaluate ingestion failure, if ingestion error - make corrections
- ends with data re-submission to ingesting state

Data requirements for data providers

Below are what we ask of the data to make it easily searchable in the cyberinfrastructure we provide.

There are 3 kinds of data files to submit for ingestion:

specimen data with dataset metadata
media data related to and attached by reference to specimen records with metadata (use of dwc:associatedMedia is not viewed as sending media)
media files - e.g., non-archival .jpgs (see acceptable format here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations)

Packaging for specimen data

In order of preference:

DwC-A (Darwin Core Archive) in a RSS feed produced by IPT
Custom DwC-A in an RSS feed produced by Symbiota
Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)

Use Darwin Core field names: http://rs.tdwg.org/dwc/terms/

Access IPT: https://code.google.com/p/gbif-providertoolkit/

Sending data to iDigBio

An RSS feed for ready access and update is our preference
Email the files to us

Specimen metadata

Each specimen record should have a unique (within the dataset) identifier in the dwc:occurrenceID field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. Identifiers, if not GUIDSs or specifically UUIDs, are what is typically called the DwC (Darwin Core) triplet:

<dwc:institutionCode>:<dwc:collectionCode>:<dwc:catalogNumber>

example with a prefix:

urn:catalog:TNHC:Herpetology:122

If using a custom CSV, use field names that are as close to DwC terms as possible, additionally, make use of the MISC field names (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are not indexed and are not searchable.

Complete attribution and licensing

In order for each provider's data to be correctly attributed when found on the iDigBio portal, the following are important to complete:

Fill in your official institution code (dwc:InstitutionCode)
- check your entry in grbio.org and make sure it is current and complete : http://grbio.org/
  - here: Repositories: http://grbio.org/find-biorepositories
  - here: Institutional collections: http://grbio.org/find-institutional-collections
- make sure you have used the same institutionCode and collectionCode in GRBio, and your EML/IPT dialog
Go to http://GRBio.org to get their Cool URI value for your institution to store in the dwc:institutionID and dwc:collectionID fields (e.g., http://biocol.org/urn:lsid:biocol.org:col:15587)
Fill in the DwC global-to-the-dataset DwC record-level fields for intellectual property and licensing, e.g., dcterms:rights, dcterms:rightsHolder and dcterms:accessRights or use the global EML-based dwc:intellectualRights field.

dcterms:rights: any actual rights statements (IP, or otherwise), and any licenses associated with the data sets (e.g., CC0). Any right or license will appear with each record it covers.

dcterms:rightsHolder: will be blank unless the publisher has content in this field that they have entered on their own. If the publisher chooses to put their institution name, or an individual name in this field is up to them. This tends to be a blank field.

dcterms:accessRights: is where the terms of use should be placed, things such as you have to attribute us or provide us with a final copy of a given product. It will be blank unless the provider has entered content at the source and on their own.

dwc:intellectualRights example: institution-name data records may be used by individual researchers or research groups, but they may not be repackaged, resold, or redistributed in any form without the express written consent of a curatorial staff member of the institution-name. If any of these records are used in an analysis or report, the provenance of the original data must be acknowledged and the institution-name notified. The institution-name and its staff are not responsible for damages, injury or loss due to the use of these data.

Some further guidance on this subject: '...when you are completing the metadata in the IPT, under Additional Metadata, it is important to consider the licensing and rights that you may wish to publish the data under. There are a couple of interesting articles describing the reasoning behind the Creative Commons licenses, http://creativecommons.org/licenses/, at the following URLs:

It may also be useful to read the Creative Commons Wiki on using Creative Commons licenses on data. http://wiki.creativecommons.org/Data" (ref D. Bloom)

Permission to ingest

the provider needs to have permission to submit their data

Data recommendations for optimal searchability

put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year.
put elevation in METERS units in the elevation field without the units (e.g., the fields dwc:minimumElevationInMeters and dwc:maximumElevationInMeters already assume the numeric values are in meters, so there no need to include the units with the data)
do not use unescaped newline characters
do not export '0' in fields to represent no value, e.g., lat or lon
make sure lat and lon coordinates are in decimal, and no N, S, E, W
parse genus, species, infraspecific epithet if already aggregated into a scientific name
include parsed higher taxonomy, at least kingdom and family if possible, and the intervening ranks if possible.

Packaging for images / media objects

Each media record should have a unique (within the dataset) identifier in the dcterms:identifier field.
- If submitting media records with specimen data records:
  - Put the specimen dwc:occurrenceID in the ac:associatedSpecimenReference field in the image metadata CSV.
  - If you are not using IPT, generate a meta.xml file by hand and package up the files in a DwC-A like format. (No eml.xml required).
Use Audubon Core metadata, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record to go with each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable.
Just like the ownership of catalog records, the media records need to be provided freely and with permission, and each record should to have at least Creative Commons permission = "CC BY".

Error handling

When data are received from the provider during the mobilizing process step, they are evaluated for fitness. Once the evaluation is successful, the ingestion process moves from mobilizing to ingesting, and the data are submitted to the ingestion scripts. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the mobilizing staff re-submit the data to the ingesting staff.

Sample scenarios of data transformations to prepare data for ingestion

Additional references

If you want to learn about acceptable Creative Commons licenses in iDigBio:

https://www.idigbio.org/content/idigbio-intellectual-property-policy

Data Ingestion Guidance

Contents

Data Ingestion Workflow

Contact info

Process terminology

Data requirements for data providers

Packaging for specimen data

Sending data to iDigBio

Specimen metadata

Complete attribution and licensing

Permission to ingest

Data recommendations for optimal searchability

Packaging for images / media objects

Error handling

Sample scenarios of data transformations to prepare data for ingestion

Additional references

Navigation menu

Data Ingestion Guidance

Data Ingestion Workflow

Contact info

Process terminology

Data requirements for data providers

Packaging for specimen data

Sending data to iDigBio

Specimen metadata

Complete attribution and licensing

Permission to ingest

Data recommendations for optimal searchability

Packaging for images / media objects

Error handling

Sample scenarios of data transformations to prepare data for ingestion

Additional references

Navigation menu

Search