Data Ingestion Guidance

From iDigBio
Jump to navigation Jump to search


Contact information

If you need assistance related to data ingestion, contact data@idigbio.org.

Data Ingestion Workflow

Audience: iDigBio data ingestion staff and data providers

This is the process description for

  • iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching.
  • Data providers to follow to assure that data are efficiently and accurately provided to the iDigBio staff.

First step to becoming a data provider

Sending your data to iDigBio is as simple as sending an email to data@idigbio.org to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have.

iDigBio accepts specimen data and related media from any institution. If you are ready to discuss providing data to iDigBio, contact data@idigbio.org to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: Setting up an RSS feed

Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.

Data requirements for data providers

Below are what we ask of the data to make it easily searchable in the cyberinfrastructure we provide.

There are 3 kinds of data files to submit for ingestion:

  1. specimen data with dataset metadata
  2. media data related to and attached by reference to specimen records with metadata (use of dwc:associatedMedia in the occurrent/specimen data file is not viewed as sending media)
  3. media files - e.g., non-archival .jpgs (see acceptable format here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1)

Packaging for specimen data

In order of preference:

  1. DwC-A (Darwin Core Archive) produced by IPT or Symbiota (both of which expose the published archive on an RSS feed). IPT is available at: http://www.gbif.org/ipt Symbiota is available at: http://symbiota.org Providers are encouraged to use the most current version of IPT (v. 2.3 or later). Recent versions of IPT support the Audubon Core extension for media and provide improved levels of data checking (such as enforcing unique occurrenceIDs), bugfixes, etc. Providers choosing Symbiota should make contact with the Symbiota Working Group.
  2. Custom RSS feed with DwC-A following the guidance at: iDigBio RSS specification
  3. Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)
  • A custom CSV allows providers to send data beyond standards such as Dublin Core and Darwin Core. For example, providers can send tribe taxonomic information in the field "idigbio:tribe". While creating additional fields, use field names that follow DwC format (camel case), additionally, consult the MISC field names (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are indexed and available through search API.

Special note to data aggregators

Note to aggregated data providers (e.g., California Consortium of Herbaria (CCH), Calbug, Tri-Trophic TCN (TTD), Consortium of Pacific Northwest Herbaria (CPNW)):

When providing us access to your data, we highly encourage you to provide your aggregated data one provider at a time, each in their own Darwin Core archive. Each dataset should be paired with a separate EML file that includes the metadata about the dataset (such as a list of contacts). iDigBio is moving towards providing data quality feedback, data correction, annotations, and other value-added information back to the providers and thus we want individual contact information for each source provider where possible. The hope is that the information could be re-integrated at the source so that higher quality data would be in place for the provider as well as be available to downstream data consumers such as iDigBio and GBIF.

However, if that is not possible or desirable, we still welcome your aggregated data as one monolith.

In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 (http://creativecommons.org/publicdomain/zero/1.0/). Further info about Creative Commons licenses is below, under the 'providing media' section.

Sending data to iDigBio

  • An RSS feed to a DwC-A for ready access and update is our preference
  • Email the files to us

Specimen metadata

  • Each specimen record should have a unique (within the dataset) identifier in the dwc:occurrenceID field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected. Identifiers recommendations:
  • a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)
urn:uuid:f47ac10b-58cc-4372-a567-0e02b2c3d479
  • a simple / bare UUID:
f47ac10b-58cc-4372-a567-0e02b2c3d479

if not GUIDs or specifically UUIDs, identifiers commonly used in the past are what is typically called the DwC (Darwin Core) triplet. This form of identifier is falling out of favor by aggregators such as GBIF:

<dwc:institutionCode>:<dwc:collectionCode>:<dwc:catalogNumber>

example with a prefix (lowercase is preferred in the prefix):

urn:catalog:TNHC:Herpetology:122

Spaces embedded within the identifier string are discouraged as are bare incrementing integers.
Further examples include:

  • an Archival Resource Key (ARK):
ark:/87286/f47ac10b-58cc-4372-a567-0e02b2c3d479

UUID: We recommend uuid-4 (122 bits of total randomness) for our identifiers. There are use cases for the other versions, but 4 is typically the best when you don't care about tracking machine origin and timestamp information and simply want strong uniqueness guarantees.

Complete attribution and licensing

In order for each provider's data to be correctly attributed when found on the iDigBio portal, the following are important to complete:

dcterms:rights

Several examples of the use of public domain, recommended for specimen data:

dc:rights = Public Domain
dcterms:rights = http://creativecommons.org/publicdomain/zero/1.0/
dcterms:rights = http://creativecommons.org/publicdomain/mark/1.0/
Creative Commons rights statements (e.g., CC0 is recommended) (IP, or otherwise), chosen from the Creative Commons options. All right or license information provided with the dataset will appear in the iDigBio portal with each record it covers.

Several more examples of the use of public domain, recommended for specimen data:

xmpRights:webStatement = http://creativecommons.org/publicdomain/mark/1.0/
xmpRights:owner = Public Domain
dcterms:bibliographicCitation
Ctenomys sociabilis (MVZ 165861) for the correct attribution string for each record.
dcterms:rightsHolder
you should fill in this field if you filled in dcterms:rights. It completes who precisely owns the data rights and will assure proper and correct attribution.
dcterms:rightsHolder = University of Florida, Florida Museum of Natural History
dcterms:accessRights
is where the precise terms of use should be placed, things such as: '...you have to attribute us or provide us with a final copy of a given product'. It will be blank unless the provider has entered content at the source.

Some further guidance on this subject: when you are completing the metadata in the IPT, under Additional Metadata, it is important to consider the licensing and rights that you may wish to publish the data under. There are a couple of interesting articles describing the reasoning behind the Creative Commons licenses, http://creativecommons.org/licenses/, at the following URLs:

It may also be useful to read the Creative Commons Wiki on using Creative Commons licenses on data. http://wiki.creativecommons.org/Data" (ref D. Bloom)

On the last word on the subject of 'Attribution", in the Project Information -> funding section of IPT, you should put information about the grants you received to fund digitization. The IPT dialog will guide you for pertinent information.

Further guidance:

Permission to ingest

  • the provider needs to have permission to submit their data

Data recommendations for optimal searchability and applicability in the aggregate

Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your 'raw' data. The results of that index-building exercised are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone GBIF backbone taxonomy and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal record.

  • institutionCode and ownerInstitutionCode: we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name.
  • eventDate: put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
  • Meters: put elevation in METERS units in the elevation field without the units (e.g., the fields dwc:minimumElevationInMeters and dwc:maximumElevationInMeters already assume the numeric values are in meters, do not include the units with the data).
  • Escapes: do not use unescaped newline characters in text fields.
  • Data uncertainty: use the remarks fields to express doubt or missing values in data, Something like '?' is not a helpful value, and cannot be searched for.
  • No '0': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.
  • decimalLatitude & decimalLatitude: make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see: http://rs.tdwg.org/dwc/terms/#decimalLatitude.
  • genus, specificEpithet, infraspecificEpithet & taxonRank: parse taxon ranks. Note: if the identification is something like Aeus sp., the taxonRank=genus.
  • scientificName: combine taxon ranks into the identification value.
  • kingdom: include kingdom and other high level ranks (phylum/division, class, and order) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
  • family: include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the GBIF backbone taxonomy. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
  • vernacularName: include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName
  • higherClassification: include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
  • nomenclaturalCode: very important when not ICBN or ICZN, e.g., using Phylocode
  • country: we use the ISO country names from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 to purify the portal indexed searching. (see data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). For example for the US, the DwC fields countryCode = US and the country = United States.
  • countryCode: include a 3 character countryCode from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3. For details see: http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam.
  • continent: For details see: http://rs.tdwg.org/dwc/terms/#continent
  • dynamicProperties: when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
  • recordNumber or fieldNumber: in our experience botanists use recordNumber and all others who have collection events use fieldNumber.

Other fields for completeness that can be configured as defaults in IPT for all records:

Anyone considering contributing data should read these anecdotes. They come from users of iDigBio's aggregated data, and reveal issues of data quality.

Using PhyloCode nomenclature

If you are using PhyloCode nomenclature the following fields are recommended, instead of the standard Linneaen hierarchy-based fields (i.e., family, genus, specificEpithet):

  • higherClassification: for the PhyloCode clades. The recommended best practice is to separate the terms with a vertical bar (' | ').
  • taxonRemarks: to explain that you are not using Linneaen classification (http://rs.tdwg.org/dwc/terms/index.htm#taxonRemarks), and what protocol you are using, i.e., according to ....

Packaging for images / media objects

Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling.
  • Each media record should have a unique (within the dataset) identifier in the dcterms:identifier field.
  • If submitting media records with specimen data records, here are the critical fields to fill in:
    • sample of fully-populated AC record
      • id (dc:identifier) = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records
        urn:catalog:institutionCode:collectionCode:catalogNumber
      • identifier (dc:identifier) = id of the media record - needs to be unique within Audubon Core file, is the equivalent of the occurrenceID in the occurrence file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.
        urn:catalog:institutionCode:collectionCode:Image:catalogNumber
      • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible)
        image/jpeg
      • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page.
        http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG
      • providerManagedID (ac:providerManagedID) = if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field.
        urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dc:subtype Photograph
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier

Sending updates to iDigBio

All updates for iDigBio should be sent to us using the method by which you originally published your data. For most data systems, this will mean generating a whole new export of your data periodically. iDigBio will examine the new data file, and convert it into an update-only dataset on our end. For publishers using RSS feeds, we automatically harvest these updates regularly, and process them in about a week unless there are interruptions in our data ingestion workflow, such as system maintenance or your update getting stuck behind a very large ingestion run. If you remove any records from your data export, iDigBio will flag those records as deleted in our system, and remove them from our indexes, but they will still be available via our data API to those who know the identifiers of the records.

Instructions on changing identifiers

If you have already had your data ingested by iDigBio, and you decide to reformat or replace your specimen identifiers (occurenceIDs), and are not giving us a record identifier (recordID) with your record, you will need to add the following to your Darwin Core Archive:

Non-Darwin Core Archive publishers, or providers who wish to change record identifiers, will need to contact iDigBio to facilitate the change.

Notes on getting data from EMu into a Darwin Core Archive

The cookbook recipe is provided by Larry Gall, Yale Peabody Museum
It is straightforward to set up a feed between Axiell EMu and an IPT instance from which iDigBio can harvest. Perhaps the simplest approach is to use the scheduled operations facility in EMu to write a template that generates an output file (e.g., csv, txt) containing Darwin Core metadata to be ingested by the IPT. This output file can be produced automatically via operations at whatever frequency is desirable. Some mechanism can then be used to move the output file into a location where it is read by the IPT, either manually through the IPT UI or through a batch process. At Yale, we automate the entire workflow using cron such that 10 IPT resources get reinstantiated from EMu every day. The IPT uses MySQL as its metadata source and lives on a server separate from EMu. The output files from EMu are text files, which are scped from the EMu server to the IPT server, and used as input for daily MySQL table refreshes (truncate table xxx ; load data local infile 'yyy' into table xxx ;). In turn, the IPT is set to publish its 10 resources automatically on a daily basis.

Concern about duplicate record ingestion

Definition of a duplicate record in iDigBio: Duplicate records are two or more records in iDigBio that provide information on a single physical specimen. These records come to iDigBio from different sources. An example would be a record coming directly from the source where the physical specimen is preserved, and a copy of the information coming from an intermediary, an aggregator.

iDigBio's expectation from providers: In order to facilitate detection of duplicates, iDigBio expects providers to maintain identical globally unique identifiers (GUIDs) in the occurrenceID field. The institution holding the specimen should assign and preserve this identifier.

Detecting duplicates: Duplicates can be detected reliably ONLY if the expectation above is met. Unless consistent identifiers are present in the aggregated data, and until the community can formulate viable use cases on the desired handling of duplicate records in the portal, iDigBio does not attempt to flag these records.

After your data have been ingested

After your data have been ingested the first time, iDigBio staff will let you know, and give you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. The report page has several features that will help you make improvements to your data on subsequent updates:

  • indications of data fields that will improve searchability in the aggregate
  • set you up to check data use by visitors to the iDigBio portal
  • check for geo-correctness of your georeferenced data. Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.

The tab related to 'Data Corrected' on this recordset page tells you about fields that have been improved in our search indices only, they are not changes we have made to your data.

Error handling

When data are received from the provider during the mobilizing process step, they are evaluated for fitness. Once the evaluation is successful, the ingestion process moves from mobilizing to ingesting, and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the mobilizing staff re-submit the data to the ingesting staff.

iDigBio IPT hosting

If you would like iDigBio to host your datasets, we have a GBIF registered resource here: http://ipt.idigbio.org

We can also assist you in getting GBIF endorsement for your data in our IPT.

When sending us CSV updates to your dataset hosted in this IPT, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append.

Sample scenarios of data transformations to prepare data for ingestion

Advertising your data on iDigBio on your website

We encourage you to post a link on your institution's website informing users that they will also find your data on iDigBio's portal.

Please look here for logo material: https://www.idigbio.org/wiki/index.php/IDigBio_Logo

and consider making the link to be to your publishers page, something like:

https://www.idigbio.org/portal/recordsets/c50755ff-ca6d-4903-8e39-8b0e236c324f

where the UUID on the end of this link belongs to your recordset. The link to your recordset can be found here: iDigBio publishers

Additional references

If you want to learn about acceptable Creative Commons licenses in iDigBio:

Data ingestion report, progress so far

Provider assistance

=LOWER(CONCATENATE("urn:uuid:",DEC2HEX(RANDBETWEEN(0,4294967295),8),"-",DEC2HEX(RANDBETWEEN(0,65535),4),"-",DEC2HEX(RANDBETWEEN(16384,20479),4),"-",DEC2HEX(RANDBETWEEN(32768,49151),4),"-",DEC2HEX(RANDBETWEEN(0,65535),4),DEC2HEX(RANDBETWEEN(0,4294967295),8)))

Process terminology for iDigBio mobilization and ingestion staff

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

IngestionProcess.gif
  • negotiating - the process of determining provider's interest in data ingestion
    • begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
    • open a Redmine ticket in project=Data Mobilizing
    • ends with data exported by provider, ready for inspection and ingestion.
  • mobilizing - the process of evaluating data being fit for ingestion
    • begins with provider exported data and cursory inspection
    • fill in this table with provider info: eml.xml, unless there is a good eml.xml file available (e.g., from a DwC Archive)
    • ends with data passing inspection and passing to ingesting state, Redmine ticket changes to assignee=cyberinfrastructure team
  • ingesting - the process of ingesting provider's data
    • begins with Redmine ticket change to assignee=cyberinfrastructure team
    • ends with
      • data successfully ingested, ready for consumption in the portal
      • report sent back to data mobilizing staff
      • report sent to provider. Reference: Publishers Report
      • Redmine ticket set to Status= Closed
  • evaluating - the process of evaluating a failure to be ingested
    • begins with ingestion failure
      • evaluate ingestion failure, if data error - send it back to mobilizing state for corrections or
      • evaluate ingestion failure, if ingestion error - make corrections
    • ends with data re-submission to ingesting state