Data Ingestion Guidance

From iDigBio
Jump to: navigation, search


Contact information

If you need assistance related to data ingestion, contact

Data Ingestion Workflow

Audience: iDigBio data ingestion staff and data providers

This is the process description for

  • iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching.
  • Data providers to follow to assure that data are efficiently and accurately provided to the iDigBio staff.

First step to becoming a data provider

Publishing your data in iDigBio is as simple as sending a personal email to to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the portal, you would be sending a link a Darwin Core archive on an RSS feed.

iDigBio's ingestion scripts accepts specimen data and related media from any institution. If you are ready to discuss providing data to iDigBio, contact to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: Setting up an RSS feed

Verify that your institution and collection is correct here: Submit corrections as needed.

Data requirements for data providers

Below are what we ask of the data to make it easily searchable in the cyberinfrastructure we provide.

There are 3 kinds of data files to submit for ingestion:

  1. specimen data with dataset metadata
  2. media data related to and attached by reference to specimen records with metadata (use of dwc:associatedMedia in the occurrent/specimen data file is not viewed as sending media)
  3. media files - e.g., non-archival .jpgs (see acceptable format here:

Packaging for specimen data

In order of preference:

  1. DwC-A (Darwin Core Archive) produced by IPT or Symbiota (both of which expose the published archive on an RSS feed). IPT is available at: Symbiota is available at: Providers are encouraged to use the most current version of IPT (v. 2.3 or later). Recent versions of IPT support the Audubon Core extension for media and provide improved levels of data checking (such as enforcing unique occurrenceIDs), bugfixes, etc. Providers choosing Symbiota should make contact with the Symbiota Working Group.
  2. Custom RSS feed with DwC-A following the guidance at: iDigBio RSS specification
  3. Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)
  • A custom CSV allows providers to send data beyond standards such as Dublin Core and Darwin Core. For example, providers can send tribe taxonomic information in the field "idigbio:tribe". While creating additional fields, use field names that follow DwC format (camel case), additionally, consult the MISC field names (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are indexed and available through search API.

No support for DiGIR

We do not support DiGIR-based datasets, as it is an older, unsupported technology and likely to be deprecated by GBIF​.

Special note to data aggregators

Note to aggregated data providers (e.g., California Consortium of Herbaria (CCH), Calbug, Tri-Trophic TCN (TTD), Consortium of Pacific Northwest Herbaria (CPNW)):

When providing us access to your data, we highly encourage you to provide your aggregated data one provider at a time, each in their own Darwin Core archive. Each dataset should be paired with a separate EML file that includes the metadata about the dataset (such as a list of contacts). iDigBio is moving towards providing data quality feedback, data correction, annotations, and other value-added information back to the providers and thus we want individual contact information for each source provider where possible. The hope is that the information could be re-integrated at the source so that higher quality data would be in place for the provider as well as be available to downstream data consumers such as iDigBio and GBIF.

However, if that is not possible or desirable, we still welcome your aggregated data as one monolith.

In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 ( Further info about Creative Commons licenses is below, under the 'providing media' section.

Sending data to iDigBio

  • An RSS feed to a DwC-A for ready access and update is our preference
  • Email the files to us for installation in our IPT for mobilization into a DwC-A, but only after we have had a discussion.

Specimen metadata - GUIDs / identifiers (occurrenceID)

  • Each specimen record should have a unique (within the dataset) identifier in the dwc:occurrenceID field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected. Identifiers recommendations:
  • a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)
  • a simple / bare UUID:

if not GUIDs or specifically UUIDs, identifiers commonly used in the past are what is typically called the DwC (Darwin Core) triplet. This form of identifier is falling out of favor by aggregators such as GBIF:

example with a prefix (lowercase is preferred in the prefix):

Spaces embedded within the identifier string are discouraged as are bare incrementing integers.
Further examples include:

  • an Archival Resource Key (ARK):

UUID: We recommend uuid-4 (122 bits of total randomness) for our identifiers. There are use cases for the other versions, but 4 is typically the best when you don't care about tracking machine origin and timestamp information and simply want strong uniqueness guarantees.

Complete attribution and licensing

In order for each provider's data to be correctly attributed when found on the iDigBio portal, the following are important to complete:

  • Fill in your official institution code (dwc:institutionCode) and collection code (dwc:collectionCode)
  • Enter the GRBio Cool URI for institution and collection in the dwc:institutionID and dwc:collectionID fields
  • Fill in the DwC global-to-the-dataset DwC fields (in the EML file in your Darwin Core archive) for intellectual property and licensing

Several examples of the use of public domain, recommended for specimen data:

dc:rights = Public Domain
dcterms:rights =
dcterms:rights =
Creative Commons rights statements (e.g., CC0 is recommended) (IP, or otherwise), chosen from the Creative Commons options. All right or license information provided with the dataset will appear in the iDigBio portal with each record it covers.

Several more examples of the use of public domain, recommended for specimen data:

xmpRights:webStatement =
xmpRights:owner = Public Domain
Ctenomys sociabilis (MVZ 165861) for the correct attribution string for each record.
you should fill in this field if you filled in dcterms:rights. It completes who precisely owns the data rights and will assure proper and correct attribution.
dcterms:rightsHolder = University of Florida, Florida Museum of Natural History
is where the precise terms of use should be placed, things such as: ' have to attribute us or provide us with a final copy of a given product'. It will be blank unless the provider has entered content at the source.

Some further guidance on this subject: when you are completing the metadata in the IPT, under Additional Metadata, it is important to consider the licensing and rights that you may wish to publish the data under. There are a couple of interesting articles describing the reasoning behind the Creative Commons licenses,, at the following URLs:

It may also be useful to read the Creative Commons Wiki on using Creative Commons licenses on data." (ref D. Bloom)

On the last word on the subject of 'Attribution", in the Project Information -> funding section of IPT, you should put information about the grants you received to fund digitization. The IPT dialog will guide you for pertinent information.

Further guidance:

Permission to ingest

  • the provider needs to have permission to submit their data

Data recommendations for optimal searchability and applicability in the aggregate

We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The scientific name is matched to the GBIF backbone GBIF backbone taxonomy to correct typos and older names. When an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the index layer of the specimen record. Kingdom, when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended. We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date. We support and encourage you to use the GBIF recommended set of occurrence record fields found here: We don't have a long list of required fields (occurrenceID, institutionCode, scientificName, kingdom, taxonRank, basisOfRecord), but we strongly recommend that you address as many of these fields below as possible. See below for further 'Taxonomy' information.

Data ownership

  • institutionCode and ownerInstitutionCode: we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
  • collectionCode: this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).


  • scientificName: combine taxon ranks into the identification value, include author and year of applicable.
  • genus, specificEpithet, infraspecificEpithet & taxonRank: parse taxon ranks.
    • Note: if the identification is something like Aeus sp., the taxonRank=genus.
    • Note: the value of taxonRank must be a rank that is a Darwin Core term. Many super/sub/infra ranks are not valid in this case. Put them instead into the higherClassification amalgamated string.
  • kingdom: include kingdom and other high level ranks (phylum/division, class, and order where applicable) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
  • family: include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the GBIF backbone taxonomy. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
  • higherClassification: include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see:
  • nomenclaturalCode: very important when not ICBN or ICZN, e.g., using Phylocode
  • vernacularName: include common names for broader audience findability. For details see:

Measurements and dates

  • eventDate: put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., If you have any legitimate parts of the eventDate field, parse them out into the numeric individual day, month and year fields.
  • Meters: put elevation in METERS units in the elevation field without the units (e.g., the fields dwc:minimumElevationInMeters and dwc:maximumElevationInMeters already assume the numeric values are in meters, do not include the units with the data).

Data tics

  • Escapes: do not use unescaped newline or tab characters in text fields.
  • Data uncertainty: use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
  • No '0': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.


Aggregating data within a record

GenBank and other genetic sequence references

  • associatedSequences

The researchers who use our data are especially appreciative when collections people add genetic sequence identifiers to their specimen records ('|' separated list). For details see:

Collection Event

  • recordNumber or fieldNumber: in our experience botanists use recordNumber and all others who have collection events use fieldNumber.

Other fields for completeness that can be configured as defaults in IPT for all records

Dataset metadata (information about the dataset as a whole, better attribution)

If you are building a Darwin Core Archive via IPT, Symbiota or some other means, be sure to include project ID (the grant number) in the EML file (on the 'Project Data' tab in IPT) that any of your records were created with. This will greatly increase the correct and complete attribution your data gets when it is used by researchers. This information resides in the project block of the meta.eml file in the archive.

  • Project ID


Anyone considering contributing data should read these anecdotes. They come from users of iDigBio's aggregated data, and reveal issues of data quality.

Data downloads

When your data are downloaded by users of the portal, both the raw and indexed data are included. For details, see

Using PhyloCode nomenclature

If you are using PhyloCode nomenclature the following fields are recommended (in addition to scientificName), instead of the standard Linneaen hierarchy-based fields (i.e., family, genus, specificEpithet):

  • higherClassification: for the PhyloCode clades. The recommended best practice is to separate the terms with a vertical bar (' | ').
  • taxonRemarks: to explain that you are not using Linneaen classification (, and what protocol you are using, i.e., according to ....
  • nomenclaturalCode: indicate the naming system you are using.

Packaging for images / media objects - identifiers

Consult iDigBio's media policy: while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
  • Each media record should have a unique and persistent identifier in the column defined by
  • Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.

A pristine sample of a minimally-populated AC CSV published via an extension in a Darwin Core Archive:

coreid, identifier, type, format, accessURI, rights, owner, creator, metadataLanguage
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4, StillImage, image/jpeg,,, Museum of the USA, John Smith, eng

Another variation based on real-world data:

coreid, identifier, type, format, accessURI, rights, owner, creator, metadataLanguage
urn:catalog:MUSA:fish:123, 32e5da5d-c747-435c-a368-07d989259bf4, StillImage, image/jpeg,, CC0, Museum of the USA, John Smith, eng

The columns are defined in the accompanying meta.xml:


  • If submitting media records with specimen data records, here are the critical fields to fill in:
    • coreid - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples:
    • identifier (dcterms:identifier or dc:identifier) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID. Examples:
    • format (dc:format) = Media Type / MIME Type (from controlling vocabulary if possible). Examples:
    • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page. Examples:
    • providerManagedID (ac:providerManagedID) = (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples:

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI. Media embedded on a webpage is a considered a webpage and thus will not be treated as media. accessURI should point to the media itself.

Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights -

preferred - dcterms:rights
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dc:subtype Photograph
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms,, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier

Sending updates to iDigBio

All updates for iDigBio should be sent to us using the method by which you originally published your data. For most data systems, this will mean generating a whole new export of your data periodically. iDigBio will examine the new data file, and convert it into an update-only dataset on our end. For publishers using RSS feeds, we automatically harvest these updates regularly, and process them in about a week unless there are interruptions in our data ingestion workflow, such as system maintenance or your update getting stuck behind a very large ingestion run. If you remove any records from your data export, iDigBio will flag those records as deleted in our system, and remove them from our indexes, but they will still be available via our data API to those who know the identifiers of the records.

Instructions on changing identifiers (occurrenceID)

If you have already had your data ingested by iDigBio, and you decide to reformat or replace your specimen identifiers (occurenceIDs), and are not giving us a record identifier (recordID via Symbiota) with your record, you will need to add the following to your Darwin Core Archive:

Non-Darwin Core Archive publishers, or providers who wish to change record identifiers, will need to contact iDigBio to facilitate the change.

Notes on getting data from EMu into a Darwin Core Archive

The cookbook recipe is provided by Larry Gall, Yale Peabody Museum
It is straightforward to set up a feed between Axiell EMu and an IPT instance from which iDigBio can harvest. Perhaps the simplest approach is to use the scheduled operations facility in EMu to write a template that generates an output file (e.g., csv, txt) containing Darwin Core metadata to be ingested by the IPT. This output file can be produced automatically via operations at whatever frequency is desirable. Some mechanism can then be used to move the output file into a location where it is read by the IPT, either manually through the IPT UI or through a batch process. At Yale, we automate the entire workflow using cron such that 10 IPT resources get reinstantiated from EMu every day. The IPT uses MySQL as its metadata source and lives on a server separate from EMu. The output files from EMu are text files, which are scped from the EMu server to the IPT server, and used as input for daily MySQL table refreshes (truncate table xxx ; load data local infile 'yyy' into table xxx ;). In turn, the IPT is set to publish its 10 resources automatically on a daily basis.

Concern about duplicate record ingestion

Definition of a duplicate record in iDigBio: Duplicate records are two or more records in iDigBio that provide information on a single physical specimen. These records come to iDigBio from different sources. An example would be a record coming directly from the source where the physical specimen is preserved, and a copy of the information coming from an intermediary, an aggregator.

iDigBio's expectation from providers: In order to facilitate detection of duplicates, iDigBio expects providers to maintain identical globally unique identifiers (GUIDs) in the occurrenceID field. The institution holding the specimen should assign and preserve this identifier.

Detecting duplicates: Duplicates can be detected reliably ONLY if the expectation above is met. Unless consistent identifiers are present in the aggregated data, and until the community can formulate viable use cases on the desired handling of duplicate records in the portal, iDigBio does not attempt to flag these records.

After your data have been ingested

After your data have been ingested the first time, iDigBio staff will let you know by sending you a link to your recordset, e.g., This report page has several features that will help you make improvements to your data on subsequent updates:

  • 'Data Corrected' tab - these are fields that have been updated in our search index layer to help search in the aggregate.
    • At the very least, check for the geo-correctness of your georeferenced data: click on any flag to see the records that have been corrected.
  • 'Data Use' tab - these are individual reports of your data being used by visitor to the portal or looking for data using the API.
  • 'Raw' - this is handy for contrasting what you sent us versus what is in the index layer. When data are dowloaded, the resulting DwC-A contains both versions.

Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.

There is a lot of attention now by data providers, data aggregators and especially by data users/researchers on improving the quality of mobilized data. Each recordset that we ingest has a data quality flags report (see above example link). If the data have also been ingested by GBIF, see their documentation here: There is work on-going at TDWG to standardize what is flagged and what it is called.

Error handling

When data are received from the provider during the mobilizing process step, they are evaluated for fitness. Once the evaluation is successful, the ingestion process moves from mobilizing to ingesting, and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the mobilizing staff re-submit the data to the ingesting staff.

iDigBio IPT hosting

If you would like iDigBio to host your datasets, we have a GBIF registered resource here:

We can also assist you in getting GBIF USGS endorsement for your data in our IPT (applicable to US-based datasets only).

When sending us CSV updates to your dataset hosted in this IPT, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append. Be mindful that if you re adding fields to your export, to put them at the end of the columns in your CSV, because IPT is not set up to look for your mapped fields in a different order than the original.

Sample scenarios of data transformations to prepare data for ingestion

Advertising your data on iDigBio on your website

We encourage you to post a link on your institution's website informing users that they will also find your data on iDigBio's portal.

Please look here for logo material:

and consider making the link to be to your publishers page, something like:
where the UUID on the end of this link belongs to your recordset. The link to your recordset can be found here: iDigBio publishers

Additional references

If you want to learn about acceptable Creative Commons licenses in iDigBio:

Data ingestion report, progress so far

Provider assistance


Process terminology for iDigBio mobilization and ingestion staff

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

  • negotiating - the process of determining provider's interest in data ingestion
    • begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
    • open a Redmine ticket in project=Data Mobilizing
    • ends with data exported by provider, ready for inspection and ingestion.
  • mobilizing - the process of evaluating data being fit for ingestion
    • begins with provider exported data and cursory inspection
    • fill in this table with provider info: eml.xml, unless there is a good eml.xml file available (e.g., from a DwC Archive)
    • ends with data passing inspection and passing to ingesting state, Redmine ticket changes to assignee=cyberinfrastructure team
  • ingesting - the process of ingesting provider's data
    • begins with Redmine ticket change to assignee=cyberinfrastructure team
    • ends with
      • data successfully ingested, ready for consumption in the portal
      • report sent back to data mobilizing staff
      • report sent to provider. Reference: Publishers Report
      • Redmine ticket set to Status= Closed
  • evaluating - the process of evaluating a failure to be ingested
    • begins with ingestion failure
      • evaluate ingestion failure, if data error - send it back to mobilizing state for corrections or
      • evaluate ingestion failure, if ingestion error - make corrections
    • ends with data re-submission to ingesting state