Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
Line 20: Line 20:
*data recommendations for optimal searchability:
*data recommendations for optimal searchability:
**put dates in ISO 8601 format, i.e., YYYY-MM-DD
**put dates in ISO 8601 format, i.e., YYYY-MM-DD
**put elevation in METERS in the elevation field without the units
**put elevation in METERS units in the elevation field without the units
**no unescaped newline characters
**parse out genus, species, infraspecific epithet if already aggregated into a scientific name
**parse out genus, species, infraspecific epithet if already aggregated into a scientific name
**take caution to preserve diacritics in people and place names when saving or exporting for ingestion (save the data in UTF8 format).
**take caution to preserve diacritics in people and place names when saving or exporting for ingestion (save the data in UTF8 format).

Revision as of 22:24, 10 January 2014

Guidance When First Considering iDigBio Data Ingestion

Contact Info

For assistance, contact data@idigbio.org

Below are what we ask of the data to make it fit for use in the cyberinfrastructure we are building:

Data Requirements

3 kinds of data are submitted for ingestion:

  1. specimen data,
  2. media data related to and attached by reference to specimen records,
  3. media files

For specimen data

  • format: IPT, DwC-A or CSV, plus an RSS feed for ready access and update is recommended, otherwise email the files to us
  • metadata:
    • each specimen record needs to have a unique (within the dataset) identifier in the occurrenceID field.
    • name the fields as close to Darwin Core as possible, and additionally use the MISC field names (local iDigbio extensions to DarwinCore)
  • you need to have permission to submit the data
  • data recommendations for optimal searchability:
    • put dates in ISO 8601 format, i.e., YYYY-MM-DD
    • put elevation in METERS units in the elevation field without the units
    • no unescaped newline characters
    • parse out genus, species, infraspecific epithet if already aggregated into a scientific name
    • take caution to preserve diacritics in people and place names when saving or exporting for ingestion (save the data in UTF8 format).

For images/media objects

  1. each media record needs to have a GUID: a persistent globally unique identifier or at least a unique (within the dataset) identifier in the occurrenceID field.
    1. if submitting media records with specimen data records, put the specimen data record into the ??? field
  2. we need there to be Audubon Core metadata file, with one record to go with each media record, and we can provide coaching to help you create that file. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable.
  3. just like the ownership of catalog records, the media records need to provided freely and with permission, and each record needs to have at least Creative Commons permission = "CC BY"

The methods for linking the catalog records to the media records are in this document, as well as explanation about creating GUIDs for the records:

Details about data ingestion requirements and guidelines are here:

Additional info about image format is here:

If you need to learn about acceptable Creative Commons licenses in iDigBio:

Sample Scenarios of Data Transformations to Prepare Data for Ingestion