Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
 
(95 intermediate revisions by 5 users not shown)
Line 4: Line 4:
If you need assistance related to data ingestion, contact [mailto:data@idigbio.org data@idigbio.org].
If you need assistance related to data ingestion, contact [mailto:data@idigbio.org data@idigbio.org].
= Data Ingestion Workflow  =
= Data Ingestion Workflow  =
Working copy 1.3 (December 2015)


Audience: iDigBio data ingestion staff and data providers
Audience: iDigBio data ingestion staff and data providers
Line 14: Line 12:


== First step to becoming a data provider  ==
== First step to becoming a data provider  ==
Sending your data to iDigBio is as simple as sending an email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have.
'''PUBLISH''': Publishing your data in iDigBio is as simple as sending a personal email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the iDigBio portal, you would be sending a link a Darwin Core archive on an RSS feed.
 
iDigBio's ingestion scripts accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
 
'''REGISTER''': For US-based institutions, verify that your institution and collection are correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.


iDigBio accepts specimen data and related media from '''any''' institution. If  you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
== Darwin Core Validator Tools ==


Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
'''IMPORTANT:''' ''Before'' you submit data to us: please ensure your data are [http://rs.tdwg.org/dwc/terms/ Darwin Core] (DwC) compliant. GBIF provides two useful data tools that can check data for DwC compliance.
 
'''Data Validator''': https://www.gbif.org/tools/data-validator This is an in-depth tool that checks for syntactical correctness of a provided dataset as well as checking the validity of the content contained within the dataset. This tool is free to use, and '''requires logging in with a free GBIF account'''. (STRONGLY RECOMMENDED)
 
'''Darwin Core Archive Validator''': https://tools.gbif.org/dwca-validator/ This is a simple tool that will check the structure of a provided Darwin Core Archive (DwC-A). This tool is less in-depth than the above Data Validator, but is also free to use, and '''does not require a login'''. (Acceptable)
 
The above tools can be used on '''any dataset''', regardless if they are to be published on GBIF. Using these tools will not publish your data to GBIF.


== Data requirements for data providers  ==
== Data requirements for data providers  ==
Line 30: Line 38:
==Packaging for specimen data==
==Packaging for specimen data==
In order of preference:
In order of preference:
#DwC-A (Darwin Core Archive) produced by IPT on a RSS feed. IPT is available at: https://code.google.com/p/gbif-providertoolkit/  Providers are encouraged to use the most current version of IPT (v. 2.1 or later) that supports the Audubon Core extension, especially if they want to include media with their specimen records.
#DwC-A (Darwin Core Archive) produced by IPT or Symbiota (both of which expose the published archive on an RSS feed). IPT is available at: http://www.gbif.org/ipt  Symbiota is available at: http://symbiota.org Providers are encouraged to use the most current version of IPT (v. 2.3 or later). Recent versions of IPT support the Audubon Core extension for media and provide improved levels of data checking (such as enforcing unique occurrenceIDs), bugfixes, etc. Providers choosing Symbiota should make contact with the [[IDigBio_Working_Groups#Symbiota_Working_Group_.28SWG.29|Symbiota Working Group]].
#Custom DwC-A on an RSS feed produced by Symbiota
#Custom RSS feed with DwC-A following the guidance at: [[CYWG iDigBio DwC-A Pull Ingestion| iDigBio RSS specification]]
#Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)
#Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)
#Custom RSS feed following the guidance at: [[CYWG iDigBio DwC-A Pull Ingestion| iDigBio RSS specification]]


* DwC-A uses field names from:
* DwC-A uses field names from:
Line 40: Line 47:


* A custom CSV allows providers to send data beyond standards such as Dublin Core and Darwin Core. For example, providers can send tribe taxonomic information in the field "idigbio:tribe". While creating additional fields, use field names that follow DwC format (camel case), additionally, consult the [[MISC-Authority-File-Working-Group#Data_Element_Lists_by_Data_Model_Concept|MISC field names]] (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are indexed and available through search API.
* A custom CSV allows providers to send data beyond standards such as Dublin Core and Darwin Core. For example, providers can send tribe taxonomic information in the field "idigbio:tribe". While creating additional fields, use field names that follow DwC format (camel case), additionally, consult the [[MISC-Authority-File-Working-Group#Data_Element_Lists_by_Data_Model_Concept|MISC field names]] (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are indexed and available through search API.
===No support for DiGIR===
We do not support DiGIR-based datasets, as it is an older, unsupported technology and likely to be deprecated by GBIF​.


===Special note to data aggregators===
===Special note to data aggregators===
Line 49: Line 59:


In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 (http://creativecommons.org/publicdomain/zero/1.0/). Further info about Creative Commons licenses is below, under the 'providing media' section.
In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 (http://creativecommons.org/publicdomain/zero/1.0/). Further info about Creative Commons licenses is below, under the 'providing media' section.
===Note on Sensitive Data/Endangered Species Data===
It is the responsibility of the data provider to obfuscate, mask, or exclude data related to sensitive/endangered species. If any of these data are included in a dataset, iDigBio will not take measures to exclude them. Providers may use [https://terms.tdwg.org/wiki/dwc:informationWithheld dwc:InformationWithheld] to indicate records that have sensitive information withheld; [http://portal.idigbio.org/portal/records/e6c5dffc-4ad1-4d9d-800f-5796baec1f65 see an example record here].
===Note on Federal Data===
iDigBio is actively avoiding the ingestion of federal recordsets into the portal. However, if federal records happen to be included with a valid non-federal specimen records dataset, iDigBio will not take any measures to filter or exclude the federal records.


===Sending data to iDigBio===
===Sending data to iDigBio===
*An [[CYWG iDigBio DwC-A Pull Ingestion|RSS feed]] to a DwC-A for ready access and update is our preference
*An [[CYWG iDigBio DwC-A Pull Ingestion|RSS feed]] to a DwC-A for ready access and update is our preference
*Email the files to us
*Email the files to us for installation in our IPT for mobilization into a DwC-A, but only after we have had a discussion.
 
===Specimen metadata - GUIDs / identifiers (occurrenceID)===
*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected.
 
Identifier recommendations:
 
*a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)<br>
<pre>urn:uuid:f47ac10b-58cc-4372-a567-0e02b2c3d479</pre>
 
*a simple / bare UUID:<br>
<pre>f47ac10b-58cc-4372-a567-0e02b2c3d479</pre>


===Specimen metadata===
if not GUIDs or specifically UUIDs, identifiers commonly used in the past are what is typically called the DwC (Darwin Core) triplet. This form of identifier is falling out of favor by aggregators such as GBIF:<br>
*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected. Identifiers, if not GUIDSs or specifically UUIDs, are what is typically called the DwC (Darwin Core) triplet:<br>
  <''dwc:institutionCode''>:<''dwc:collectionCode''>:<''dwc:catalogNumber''>
  <''dwc:institutionCode''>:<''dwc:collectionCode''>:<''dwc:catalogNumber''>
example with a prefix (lowercase is preferred in the prefix): <pre>urn:catalog:TNHC:Herpetology:122</pre>
example with a prefix (lowercase is preferred in the prefix): <pre>urn:catalog:TNHC:Herpetology:122</pre>
Spaces embedded within the identifier string are discouraged as are bare incrementing integers.<br>
Spaces embedded within the identifier string are discouraged as are bare incrementing integers.<br>
Further examples include:
Further examples include:
*a simple / bare UUID:<br>
<pre>f47ac10b-58cc-4372-a567-0e02b2c3d479</pre>
*a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)<br>
<pre>urn:uuid:f47ac10b-58cc-4372-a567-0e02b2c3d479</pre>
*an Archival Resource Key (ARK):<br>
*an Archival Resource Key (ARK):<br>


Line 79: Line 99:
**Check that your institution/collection info is correct here: https://www.idigbio.org/portal/collections
**Check that your institution/collection info is correct here: https://www.idigbio.org/portal/collections
**Check your entry in grbio.org and make sure it is current and complete : Institutions: http://grbio.org/find-biorepositories
**Check your entry in grbio.org and make sure it is current and complete : Institutions: http://grbio.org/find-biorepositories
**Make sure you have used the same institutionCode and collectionCode in GRBio and your EML/IPT dialog
**Make sure you have used the same institutionCode and collectionCode in the collection resources above (GRBio and iDigBio) as in your occurrence data file.
*Enter the GRBio Cool URI for institution and collection in the ''dwc:institutionID'' and ''dwc:collectionID'' fields
*Enter the GRBio Cool URI for institution and collection in the ''dwc:institutionID'' and ''dwc:collectionID'' fields
**Go to http://GRBio.org to get the Cool URI value for your institution/collection in the alternateIdentifier field in the EML dialog (e.g., http://biocol.org/urn:lsid:biocol.org:col:15587).  
**Go to http://GRBio.org to get the Cool URI value for your institution/collection in the alternateIdentifier field in the IPT EML dialog (e.g., http://biocol.org/urn:lsid:biocol.org:col:15587).  
*Fill in the DwC global-to-the-dataset DwC fields (in the EML file in your Darwin Core archive) for intellectual property and licensing
*Fill in the DwC global-to-the-dataset DwC fields (in the EML file in your Darwin Core archive) for intellectual property and licensing
** [http://terms.tdwg.org/wiki/dcterms:rights dcterms:rights]
** [http://terms.tdwg.org/wiki/dcterms:rights dcterms:rights]
Line 129: Line 149:
===Permission to ingest===
===Permission to ingest===
*the provider needs to have permission to submit their data
*the provider needs to have permission to submit their data
===Symbiota - GBIF - iDigBio field differences===
Neil Cobb, PI of SCAN and LepNet TCNs contributed this handy [https://www.idigbio.org//sites/default/files/sites/default/files/Symbiota_Fields_DwC_GBIF_iDigBio-2.xlsx explanation] of field use and differences between Symbiota, GBIF and iDigBio


===Data recommendations for optimal searchability and applicability in the aggregate===
===Data recommendations for optimal searchability and applicability in the aggregate===
*'''institutionCode''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name.
We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an '''index''' to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The ''scientific name'' is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] to correct typos and older names. See the "Backbone matching" section of [https://www.gbif.org/infrastructure/processing GBIF Data Processing] for information on GBIF's methods.  When iDigBio finds an exact or fuzzy match in the GBIF backbone, the matched info is used as the authority to fill in and regularize the taxonomic information in the indexed version of the specimen record. ''Kingdom'', when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended.  We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date.
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further '[https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Taxonomy Taxonomy]' information.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, so there no need to include the units with the data).
====Data stewardship / ownership====
*'''Escapes''': do not use unescaped newline characters in text fields.
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Something like '?' is not a helpful value, and cannot be searched for.
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.
 
*'''decimalLatitude''' & '''decimalLatitude''': make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see:  http://rs.tdwg.org/dwc/terms/#decimalLatitude.
====Taxonomy====
*'''genus''', '''specificEpithet''', '''infraspecificEpithet''' & '''taxonRank''': parse taxon ranks. Note: if the identification is something like ''Aeus sp.'', the taxonRank=genus.
*'''[[scientificName]]''': combine taxon ranks into the identification value, include author and year of applicable.
*'''scientificName''': combine taxon ranks into the identification value.
*'''genus''', '''specificEpithet''', '''infraspecificEpithet''' & '''taxonRank''': parse taxon ranks.
*'''family''': include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy]. Higher taxonomy is NOT intuited in the case where the identification history extension is included in your archive.
**Note: if the identification is something like ''Aeus sp.'', the taxonRank=genus.  
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
**Note: the value of taxonRank must be a rank that is a Darwin Core term. Many super/sub/infra ranks are not valid in this case. Put them instead into the higherClassification amalgamated string.
*'''[[kingdom]]''': include kingdom and other high level ranks ('''phylum/division''', '''class''', and '''order''' where applicable) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
*'''family''': include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy]. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
*'''higherClassification''': include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
*'''higherClassification''': include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''country''': we use the ISO country names from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 to purify the portal indexed searching. (see data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). For example for the US, the DwC fields countryCode = US and the country = United States.
*'''vernacularName''': include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName
*'''countryCode''': include a 3 character countryCode from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3. For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam.
 
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. If you have any legitimate parts of the eventDate field, parse them out into the numeric individual ''day'', ''month'' and ''year'' fields.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
 
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000', 'unknown' and any other placeholder value.
 
====Geolocation====
*'''country''': we recommend using the TGN preferred names [http://www.getty.edu/research/tools/vocabularies/tgn/ Getty Thesaurus of Geographic Names]. (See also the data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags).  
*'''countryCode''': For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam.
*'''continent''': For details see: http://rs.tdwg.org/dwc/terms/#continent
*'''continent''': For details see: http://rs.tdwg.org/dwc/terms/#continent
*'''decimalLatitude''' & '''decimalLatitude''': make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see:  http://rs.tdwg.org/dwc/terms/#decimalLatitude.
====Aggregating data within a record====
*'''dynamicProperties''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
*'''dynamicProperties''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
*'''measurementOrFact''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#MeasurementOrFact
====GenBank and other genetic sequence references====
*'''associatedSequences'''
The researchers who use our data are especially appreciative when collections people add genetic sequence identifiers to their specimen records ('|' separated list). For details see: http://rs.tdwg.org/dwc/terms/#associatedSequences
====Collection Event====
*'''recordNumber''' or '''fieldNumber''': in our experience botanists use recordNumber and all others who have collection events use fieldNumber.
*'''recordNumber''' or '''fieldNumber''': in our experience botanists use recordNumber and all others who have collection events use fieldNumber.
Other fields for completeness that can be configured as defaults in IPT for all records:
 
*'''basisOfRecord'''="PreservedSpecimen" or "FossilSpecimen". For details see: http://rs.tdwg.org/dwc/terms/#basisOfRecord
====Other fields for completeness that can be configured as defaults in IPT for all records====
*'''[[basisOfRecord]]'''="PreservedSpecimen" or "FossilSpecimen". HumanObservation records are out of our scope. Exceptions apply to machineObservation records. For details see: http://rs.tdwg.org/dwc/terms/#basisOfRecord
*'''type'''="Physical Object" For details see: http://rs.tdwg.org/dwc/terms/#type
*'''type'''="Physical Object" For details see: http://rs.tdwg.org/dwc/terms/#type
*'''language'''= "en" For details see: http://rs.tdwg.org/dwc/terms/#language
*'''language'''= "en" For details see: http://rs.tdwg.org/dwc/terms/#language


====Dataset metadata (information about the dataset as a whole, better attribution)====
If you are building a Darwin Core Archive via IPT, Symbiota or some other means, be sure to include project ID (the grant number) in the EML file (on the 'Project Data' tab in IPT) that any of your records were created with. This will greatly increase the correct and complete attribution your data gets when it is used by researchers. This information resides in the project block of the meta.eml file in the archive.
*Project ID
====Anecdotes====
Anyone considering contributing data should read these [[Data_Problems |anecdotes]]. They come from users of iDigBio's aggregated data, and reveal issues of data quality.
Anyone considering contributing data should read these [[Data_Problems |anecdotes]]. They come from users of iDigBio's aggregated data, and reveal issues of data quality.
=== Data downloads===
When your data are downloaded by users of the portal, both the raw and indexed data are included. Citation and attribution data is also included. For details, see https://www.idigbio.org/content/understanding-idigbios-data-downloads


===Using PhyloCode nomenclature===
===Using PhyloCode nomenclature===
If  you are using PhyloCode nomenclature the following fields are recommended, instead of the standard Linneaen hierarchy-based fields (i.e., family, genus, specificEpithet): <br>
If  you are using PhyloCode nomenclature the following fields are recommended (in addition to scientificName), instead of the standard Linneaen hierarchy-based fields (i.e., family, genus, specificEpithet): <br>
*higherClassification: for the PhyloCode clades. The recommended best practice is to separate the terms with a vertical bar (' | ').
*higherClassification: for the PhyloCode clades. The recommended best practice is to separate the terms with a vertical bar (' | ').
*taxonRemarks: to explain that you are not using Linneaen classification (http://rs.tdwg.org/dwc/terms/index.htm#taxonRemarks), and what protocol you are using, i.e., according to ....
*taxonRemarks: to explain that you are not using Linneaen classification (http://rs.tdwg.org/dwc/terms/index.htm#taxonRemarks), and what protocol you are using, i.e., according to ....
*nomenclaturalCode: indicate the naming system you are using.


==Packaging for images / media objects==
==Packaging for images / media objects - identifiers==
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
*Each media record should have a unique (within the dataset) identifier in the ''dcterms:identifier'' field.
*Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier
*If submitting media records with specimen data records, here are the critical fields to fill in:
*Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.
** sample of fully-populated AC record
 
***'''id (dc:identifier)''' = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
A pristine sample of a minimally-populated Audubon Core (AC) CSV published via an extension in a Darwin Core Archive:
***'''identifier (dc:identifier)''' = id of the media record - needs to be unique within Audubon Core file, is the equivalent of the occurrenceID in the occurrence file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.<pre>urn:catalog:institutionCode:collectionCode:Image:catalogNumber</pre>
 
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG</pre>
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng
***'''providerManagedID (ac:providerManagedID)''' if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)</pre>
</pre>
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
 
Another variation based on real-world data:
 
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng
</pre>
 
 
The column mappings are defined in the accompanying meta.xml.
 
 
 
If submitting media records with specimen data records, here are the critical fields to fill in:
*'''coreid''' - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples: <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre><pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre><pre>123456</pre>
*'''identifier''' ([http://purl.org/dc/terms/identifier dcterms:identifier] or [http://purl.org/dc/elements/1.1/identifier dc:identifier]) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID.  Examples: <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre>
*'''format''' ([http://purl.org/dc/elements/1.1/format dc:format]) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples: <pre>image/jpeg</pre><pre>audio/mpeg</pre>
*'''accessURI''' ([http://rs.tdwg.org/ac/terms/accessURI ac:accessURI]) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page. Examples: <pre>http://example.com/IMAGES/00000001.jpg</pre><pre>http://example.com/objects/987654321</pre>
*'''providerManagedID''' ([http://rs.tdwg.org/ac/terms/providerManagedID ac:providerManagedID]) =  (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples: <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre>
 
'''Note:''' dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.  Media embedded on a webpage is a considered a webpage and thus will not be treated as media.  accessURI should point to the media itself.




Line 206: Line 281:
|valign="top"|dc:type
|valign="top"|dc:type
|valign="top"| StillImage, Sound, MovingImage
|valign="top"| StillImage, Sound, MovingImage
|valign="top"|
|-
|valign="top"|dc:subtype
|valign="top"| Photograph
|valign="top"|
|valign="top"|
|-
|-
Line 256: Line 335:
All updates for iDigBio should be sent to us using the method by which you originally published your data. For most data systems, this will mean generating a whole new export of your data periodically. iDigBio will examine the new data file, and convert it into an update-only dataset on our end. For publishers using RSS feeds, we automatically harvest these updates regularly, and process them in about a week unless there are interruptions in our data ingestion workflow, such as system maintenance or your update getting stuck behind a very large ingestion run. If you remove any records from your data export, iDigBio will flag those records as deleted in our system, and remove them from our indexes, but they will still be available via our data API to those who know the identifiers of the records.
All updates for iDigBio should be sent to us using the method by which you originally published your data. For most data systems, this will mean generating a whole new export of your data periodically. iDigBio will examine the new data file, and convert it into an update-only dataset on our end. For publishers using RSS feeds, we automatically harvest these updates regularly, and process them in about a week unless there are interruptions in our data ingestion workflow, such as system maintenance or your update getting stuck behind a very large ingestion run. If you remove any records from your data export, iDigBio will flag those records as deleted in our system, and remove them from our indexes, but they will still be available via our data API to those who know the identifiers of the records.


==Instructions on changing identifiers==
==Instructions on changing identifiers (occurrenceID)==
If you have already had your data ingested by iDigBio, and you decide to reformat or replace your specimen identifiers (occurenceIDs), and are not giving us a record identifier (recordID) with your record, you will need to add the following to your Darwin Core Archive:
If you have already had your data ingested by iDigBio, and you decide to reformat or replace your specimen identifiers (occurenceIDs), and are not giving us a record identifier (recordID via Symbiota) with your record, you will need to add the following to your Darwin Core Archive:
* include the resource relationship extension in your archive and document the relationship using the OWL 'sameAs' relationship (http://www.w3.org/TR/owl-ref/#sameAs-def). A trivial example archive can be found at: [http://www.idigbio.org/sites/default/files/sites/default/files/DarwinCoreExamples/sameAs.zip sameAs Archive]
* include the resource relationship extension in your archive and document the relationship using the OWL 'sameAs' relationship (http://www.w3.org/TR/owl-ref/#sameAs-def). A trivial example archive can be found at: [http://www.idigbio.org/sites/default/files/sites/default/files/DarwinCoreExamples/sameAs.zip sameAs Archive]


Line 277: Line 356:


== After your data have been ingested==
== After your data have been ingested==
After your data have been ingested the first time, iDigBio staff will let you know, and give you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. The report page has several features that will help you make improvements to your data on subsequent updates:
After your data have been ingested the first time, iDigBio staff will let you know by sending you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. This report page has several features that will help you make improvements to your data on subsequent updates:
* indications of data fields that will improve searchability in the aggregate
* 'Data Corrected' tab - these are fields that have been updated in our search index layer to help search in the aggregate.
* set you up to check data use by visitors to the iDigBio portal
**At the very least, check for the geo-correctness of your georeferenced data: click on any flag to see the records that have been corrected.
* check for geo-correctness of your georeferenced data. Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.
* 'Data Use' tab - these are individual reports of your data being used by visitor to the portal or looking for data using the API.
The tab related to 'Data Corrected' on this recordset page tells you about fields that have been improved in our search indices only, they are not changes we have made to your data.
* 'Raw' - this is handy for contrasting what you sent us versus what is in the index layer. When data are dowloaded, the resulting DwC-A contains both versions.
 
Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.
 
There is a lot of attention now by data providers, data aggregators and especially by data users/researchers on improving the quality of mobilized data. Each recordset that we ingest has a data quality flags report (see above example link). If the data have also been ingested by GBIF, see their documentation here: https://github.com/gbif/ipt/wiki/dataQualityChecklist#introduction. There is work on-going at TDWG to standardize what is flagged and what it is called.


==Error handling==
==Error handling==
Line 287: Line 370:
Once the evaluation is successful, the ingestion process moves from '''mobilizing''' to '''ingesting''', and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the '''mobilizing''' staff re-submit the data to the '''ingesting''' staff.
Once the evaluation is successful, the ingestion process moves from '''mobilizing''' to '''ingesting''', and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the '''mobilizing''' staff re-submit the data to the '''ingesting''' staff.
==iDigBio IPT hosting==
==iDigBio IPT hosting==
If you would like iDigBio to host your datasets, we have a GBIF registered resource here: http://ipt.idigbio.org
If you are unable to host your own IPT, please contact us to discuss available options.
 
We have a GBIF registered resource here http://ipt.idigbio.org
If this option is found to be appropriate for your data, we can also assist you in getting GBIF USGS endorsement for your data in our IPT. Please note that this endorsement is applicable to US-based datasets only.


We can also assist you in getting GBIF endorsement for your data in our IPT.
Please [https://www.idigbio.org/contact/Caitlin_%E2%80%9CCat%E2%80%9D_Chapman contact us]  if you are interested in discussing this option for your data.


When sending us CSV updates to your dataset hosted in this IPT, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append.
'''If you have data currently hosted on our IPT and need to send us updates''': when sending us CSV updates to your dataset, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append. Be mindful that if you re adding fields to your export, to put them at the end of the columns in your CSV, because IPT is not set up to look for your mapped fields in a different order than the original.


== Sample scenarios of data transformations to prepare data for ingestion  ==
== Sample scenarios of data transformations to prepare data for ingestion  ==
Line 314: Line 401:
*https://www.idigbio.org/portal/publishers
*https://www.idigbio.org/portal/publishers
===Provider assistance===
===Provider assistance===
*[[Media:ImageIngestionCheatSheet_Sheet1.pdf| How to use the image ingestion appliance and link to specimen records : image ingestion cheatsheet]]
*<s>[[Media:ImageIngestionCheatSheet_Sheet1.pdf| How to use the image ingestion appliance and link to specimen records : image ingestion cheatsheet]]</s>
*[[Media:GUIDgeneration.pdf| How to generate a UUID GUID in an Excel spreadsheet]]
*[[Media:GUIDgeneration.pdf| How to generate a UUID GUID in an Excel spreadsheet]]
<pre>
<pre>
Line 341: Line 428:
***data successfully ingested, ready for consumption in the portal
***data successfully ingested, ready for consumption in the portal
***report sent back to data mobilizing staff
***report sent back to data mobilizing staff
***report sent to provider. Reference: [https://www.idigbio.org/portal/publishers Publishers Report]
***report sent to provider. Reference: [https://www.idigbio.org/portal/publishers iDigBio portal Publishers page]
***Redmine ticket set to Status= Closed
***Redmine ticket set to Status= Closed



Latest revision as of 15:07, 18 June 2021


Contact information

If you need assistance related to data ingestion, contact data@idigbio.org.

Data Ingestion Workflow

Audience: iDigBio data ingestion staff and data providers

This is the process description for

  • iDigBio staff to follow to assure that data are successfully and efficiently moved from data provider to the portal, available for searching.
  • Data providers to follow to assure that data are efficiently and accurately provided to the iDigBio staff.

First step to becoming a data provider

PUBLISH: Publishing your data in iDigBio is as simple as sending a personal email to data@idigbio.org to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the iDigBio portal, you would be sending a link a Darwin Core archive on an RSS feed.

iDigBio's ingestion scripts accepts specimen data and related media from any institution. If you are ready to discuss providing data to iDigBio, contact data@idigbio.org to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: Setting up an RSS feed

REGISTER: For US-based institutions, verify that your institution and collection are correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.

Darwin Core Validator Tools

IMPORTANT: Before you submit data to us: please ensure your data are Darwin Core (DwC) compliant. GBIF provides two useful data tools that can check data for DwC compliance.

Data Validator: https://www.gbif.org/tools/data-validator This is an in-depth tool that checks for syntactical correctness of a provided dataset as well as checking the validity of the content contained within the dataset. This tool is free to use, and requires logging in with a free GBIF account. (STRONGLY RECOMMENDED)

Darwin Core Archive Validator: https://tools.gbif.org/dwca-validator/ This is a simple tool that will check the structure of a provided Darwin Core Archive (DwC-A). This tool is less in-depth than the above Data Validator, but is also free to use, and does not require a login. (Acceptable)

The above tools can be used on any dataset, regardless if they are to be published on GBIF. Using these tools will not publish your data to GBIF.

Data requirements for data providers

Below are what we ask of the data to make it easily searchable in the cyberinfrastructure we provide.

There are 3 kinds of data files to submit for ingestion:

  1. specimen data with dataset metadata
  2. media data related to and attached by reference to specimen records with metadata (use of dwc:associatedMedia in the occurrent/specimen data file is not viewed as sending media)
  3. media files - e.g., non-archival .jpgs (see acceptable format here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1)

Packaging for specimen data

In order of preference:

  1. DwC-A (Darwin Core Archive) produced by IPT or Symbiota (both of which expose the published archive on an RSS feed). IPT is available at: http://www.gbif.org/ipt Symbiota is available at: http://symbiota.org Providers are encouraged to use the most current version of IPT (v. 2.3 or later). Recent versions of IPT support the Audubon Core extension for media and provide improved levels of data checking (such as enforcing unique occurrenceIDs), bugfixes, etc. Providers choosing Symbiota should make contact with the Symbiota Working Group.
  2. Custom RSS feed with DwC-A following the guidance at: iDigBio RSS specification
  3. Custom CSV or TXT (save the data in UTF-8 format to preserve diacritics in people and place names), this option for sending only specimen data or only media data (DwC-A packaging required when sending both specimen and media data)
  • A custom CSV allows providers to send data beyond standards such as Dublin Core and Darwin Core. For example, providers can send tribe taxonomic information in the field "idigbio:tribe". While creating additional fields, use field names that follow DwC format (camel case), additionally, consult the MISC field names (local iDigbio extensions to DwC). The host association terms are an example of an extension found in the MISC. Use the XML style field names that include the domain of the schema, e.g., dwc:termName, ac:termName. Non-standard field names are indexed and available through search API.

No support for DiGIR

We do not support DiGIR-based datasets, as it is an older, unsupported technology and likely to be deprecated by GBIF​.

Special note to data aggregators

Note to aggregated data providers (e.g., California Consortium of Herbaria (CCH), Calbug, Tri-Trophic TCN (TTD), Consortium of Pacific Northwest Herbaria (CPNW)):

When providing us access to your data, we highly encourage you to provide your aggregated data one provider at a time, each in their own Darwin Core archive. Each dataset should be paired with a separate EML file that includes the metadata about the dataset (such as a list of contacts). iDigBio is moving towards providing data quality feedback, data correction, annotations, and other value-added information back to the providers and thus we want individual contact information for each source provider where possible. The hope is that the information could be re-integrated at the source so that higher quality data would be in place for the provider as well as be available to downstream data consumers such as iDigBio and GBIF.

However, if that is not possible or desirable, we still welcome your aggregated data as one monolith.

In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 (http://creativecommons.org/publicdomain/zero/1.0/). Further info about Creative Commons licenses is below, under the 'providing media' section.

Note on Sensitive Data/Endangered Species Data

It is the responsibility of the data provider to obfuscate, mask, or exclude data related to sensitive/endangered species. If any of these data are included in a dataset, iDigBio will not take measures to exclude them. Providers may use dwc:InformationWithheld to indicate records that have sensitive information withheld; see an example record here.

Note on Federal Data

iDigBio is actively avoiding the ingestion of federal recordsets into the portal. However, if federal records happen to be included with a valid non-federal specimen records dataset, iDigBio will not take any measures to filter or exclude the federal records.

Sending data to iDigBio

  • An RSS feed to a DwC-A for ready access and update is our preference
  • Email the files to us for installation in our IPT for mobilization into a DwC-A, but only after we have had a discussion.

Specimen metadata - GUIDs / identifiers (occurrenceID)

  • Each specimen record should have a unique (within the dataset) identifier in the dwc:occurrenceID field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected.

Identifier recommendations:

  • a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)
urn:uuid:f47ac10b-58cc-4372-a567-0e02b2c3d479
  • a simple / bare UUID:
f47ac10b-58cc-4372-a567-0e02b2c3d479

if not GUIDs or specifically UUIDs, identifiers commonly used in the past are what is typically called the DwC (Darwin Core) triplet. This form of identifier is falling out of favor by aggregators such as GBIF:

<dwc:institutionCode>:<dwc:collectionCode>:<dwc:catalogNumber>

example with a prefix (lowercase is preferred in the prefix):

urn:catalog:TNHC:Herpetology:122

Spaces embedded within the identifier string are discouraged as are bare incrementing integers.
Further examples include:

  • an Archival Resource Key (ARK):
ark:/87286/f47ac10b-58cc-4372-a567-0e02b2c3d479

UUID: We recommend uuid-4 (122 bits of total randomness) for our identifiers. There are use cases for the other versions, but 4 is typically the best when you don't care about tracking machine origin and timestamp information and simply want strong uniqueness guarantees.

Complete attribution and licensing

In order for each provider's data to be correctly attributed when found on the iDigBio portal, the following are important to complete:

  • Fill in your official institution code (dwc:institutionCode) and collection code (dwc:collectionCode)
  • Enter the GRBio Cool URI for institution and collection in the dwc:institutionID and dwc:collectionID fields
  • Fill in the DwC global-to-the-dataset DwC fields (in the EML file in your Darwin Core archive) for intellectual property and licensing
dcterms:rights

Several examples of the use of public domain, recommended for specimen data:

dc:rights = Public Domain
dcterms:rights = http://creativecommons.org/publicdomain/zero/1.0/
dcterms:rights = http://creativecommons.org/publicdomain/mark/1.0/
Creative Commons rights statements (e.g., CC0 is recommended) (IP, or otherwise), chosen from the Creative Commons options. All right or license information provided with the dataset will appear in the iDigBio portal with each record it covers.

Several more examples of the use of public domain, recommended for specimen data:

xmpRights:webStatement = http://creativecommons.org/publicdomain/mark/1.0/
xmpRights:owner = Public Domain
dcterms:bibliographicCitation
Ctenomys sociabilis (MVZ 165861) for the correct attribution string for each record.
dcterms:rightsHolder
you should fill in this field if you filled in dcterms:rights. It completes who precisely owns the data rights and will assure proper and correct attribution.
dcterms:rightsHolder = University of Florida, Florida Museum of Natural History
dcterms:accessRights
is where the precise terms of use should be placed, things such as: '...you have to attribute us or provide us with a final copy of a given product'. It will be blank unless the provider has entered content at the source.

Some further guidance on this subject: when you are completing the metadata in the IPT, under Additional Metadata, it is important to consider the licensing and rights that you may wish to publish the data under. There are a couple of interesting articles describing the reasoning behind the Creative Commons licenses, http://creativecommons.org/licenses/, at the following URLs:

It may also be useful to read the Creative Commons Wiki on using Creative Commons licenses on data. http://wiki.creativecommons.org/Data" (ref D. Bloom)

On the last word on the subject of 'Attribution", in the Project Information -> funding section of IPT, you should put information about the grants you received to fund digitization. The IPT dialog will guide you for pertinent information.

Further guidance:

Permission to ingest

  • the provider needs to have permission to submit their data

Symbiota - GBIF - iDigBio field differences

Neil Cobb, PI of SCAN and LepNet TCNs contributed this handy explanation of field use and differences between Symbiota, GBIF and iDigBio

Data recommendations for optimal searchability and applicability in the aggregate

We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an index to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The scientific name is matched to the GBIF backbone GBIF backbone taxonomy to correct typos and older names. See the "Backbone matching" section of GBIF Data Processing for information on GBIF's methods. When iDigBio finds an exact or fuzzy match in the GBIF backbone, the matched info is used as the authority to fill in and regularize the taxonomic information in the indexed version of the specimen record. Kingdom, when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended. We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date. We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields (occurrenceID, institutionCode, scientificName, kingdom, taxonRank, basisOfRecord), but we strongly recommend that you address as many of these fields below as possible. See below for further 'Taxonomy' information.

Data stewardship / ownership

  • institutionCode and ownerInstitutionCode: we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
  • collectionCode: this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).

Taxonomy

  • scientificName: combine taxon ranks into the identification value, include author and year of applicable.
  • genus, specificEpithet, infraspecificEpithet & taxonRank: parse taxon ranks.
    • Note: if the identification is something like Aeus sp., the taxonRank=genus.
    • Note: the value of taxonRank must be a rank that is a Darwin Core term. Many super/sub/infra ranks are not valid in this case. Put them instead into the higherClassification amalgamated string.
  • kingdom: include kingdom and other high level ranks (phylum/division, class, and order where applicable) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
  • family: include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the GBIF backbone taxonomy. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
  • higherClassification: include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
  • nomenclaturalCode: very important when not ICBN or ICZN, e.g., using Phylocode
  • vernacularName: include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName

Measurements and dates

  • eventDate: put dates in ISO 8601 format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. If you have any legitimate parts of the eventDate field, parse them out into the numeric individual day, month and year fields.
  • Meters: put elevation in METERS units in the elevation field without the units (e.g., the fields dwc:minimumElevationInMeters and dwc:maximumElevationInMeters already assume the numeric values are in meters, do not include the units with the data).

Data tics

  • Escapes: do not use unescaped newline or tab characters in text fields.
  • Data uncertainty: use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
  • No '0': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000', 'unknown' and any other placeholder value.

Geolocation

Aggregating data within a record

GenBank and other genetic sequence references

  • associatedSequences

The researchers who use our data are especially appreciative when collections people add genetic sequence identifiers to their specimen records ('|' separated list). For details see: http://rs.tdwg.org/dwc/terms/#associatedSequences

Collection Event

  • recordNumber or fieldNumber: in our experience botanists use recordNumber and all others who have collection events use fieldNumber.

Other fields for completeness that can be configured as defaults in IPT for all records

Dataset metadata (information about the dataset as a whole, better attribution)

If you are building a Darwin Core Archive via IPT, Symbiota or some other means, be sure to include project ID (the grant number) in the EML file (on the 'Project Data' tab in IPT) that any of your records were created with. This will greatly increase the correct and complete attribution your data gets when it is used by researchers. This information resides in the project block of the meta.eml file in the archive.

  • Project ID

Anecdotes

Anyone considering contributing data should read these anecdotes. They come from users of iDigBio's aggregated data, and reveal issues of data quality.

Data downloads

When your data are downloaded by users of the portal, both the raw and indexed data are included. Citation and attribution data is also included. For details, see https://www.idigbio.org/content/understanding-idigbios-data-downloads

Using PhyloCode nomenclature

If you are using PhyloCode nomenclature the following fields are recommended (in addition to scientificName), instead of the standard Linneaen hierarchy-based fields (i.e., family, genus, specificEpithet):

  • higherClassification: for the PhyloCode clades. The recommended best practice is to separate the terms with a vertical bar (' | ').
  • taxonRemarks: to explain that you are not using Linneaen classification (http://rs.tdwg.org/dwc/terms/index.htm#taxonRemarks), and what protocol you are using, i.e., according to ....
  • nomenclaturalCode: indicate the naming system you are using.

Packaging for images / media objects - identifiers

Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
  • Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier
  • Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.

A pristine sample of a minimally-populated Audubon Core (AC) CSV published via an extension in a Darwin Core Archive:

coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng

Another variation based on real-world data:

coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng


The column mappings are defined in the accompanying meta.xml.


If submitting media records with specimen data records, here are the critical fields to fill in:

  • coreid - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples:
    urn:catalog:institutionCode:collectionCode:catalogNumber
    urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4
    123456
  • identifier (dcterms:identifier or dc:identifier) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID. Examples:
    urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08
  • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples:
    image/jpeg
    audio/mpeg
  • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page. Examples:
    http://example.com/IMAGES/00000001.jpg
    http://example.com/objects/987654321
  • providerManagedID (ac:providerManagedID) = (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples:
    urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI. Media embedded on a webpage is a considered a webpage and thus will not be treated as media. accessURI should point to the media itself.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dc:subtype Photograph
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier

Sending updates to iDigBio

All updates for iDigBio should be sent to us using the method by which you originally published your data. For most data systems, this will mean generating a whole new export of your data periodically. iDigBio will examine the new data file, and convert it into an update-only dataset on our end. For publishers using RSS feeds, we automatically harvest these updates regularly, and process them in about a week unless there are interruptions in our data ingestion workflow, such as system maintenance or your update getting stuck behind a very large ingestion run. If you remove any records from your data export, iDigBio will flag those records as deleted in our system, and remove them from our indexes, but they will still be available via our data API to those who know the identifiers of the records.

Instructions on changing identifiers (occurrenceID)

If you have already had your data ingested by iDigBio, and you decide to reformat or replace your specimen identifiers (occurenceIDs), and are not giving us a record identifier (recordID via Symbiota) with your record, you will need to add the following to your Darwin Core Archive:

Non-Darwin Core Archive publishers, or providers who wish to change record identifiers, will need to contact iDigBio to facilitate the change.

Notes on getting data from EMu into a Darwin Core Archive

The cookbook recipe is provided by Larry Gall, Yale Peabody Museum
It is straightforward to set up a feed between Axiell EMu and an IPT instance from which iDigBio can harvest. Perhaps the simplest approach is to use the scheduled operations facility in EMu to write a template that generates an output file (e.g., csv, txt) containing Darwin Core metadata to be ingested by the IPT. This output file can be produced automatically via operations at whatever frequency is desirable. Some mechanism can then be used to move the output file into a location where it is read by the IPT, either manually through the IPT UI or through a batch process. At Yale, we automate the entire workflow using cron such that 10 IPT resources get reinstantiated from EMu every day. The IPT uses MySQL as its metadata source and lives on a server separate from EMu. The output files from EMu are text files, which are scped from the EMu server to the IPT server, and used as input for daily MySQL table refreshes (truncate table xxx ; load data local infile 'yyy' into table xxx ;). In turn, the IPT is set to publish its 10 resources automatically on a daily basis.

Concern about duplicate record ingestion

Definition of a duplicate record in iDigBio: Duplicate records are two or more records in iDigBio that provide information on a single physical specimen. These records come to iDigBio from different sources. An example would be a record coming directly from the source where the physical specimen is preserved, and a copy of the information coming from an intermediary, an aggregator.

iDigBio's expectation from providers: In order to facilitate detection of duplicates, iDigBio expects providers to maintain identical globally unique identifiers (GUIDs) in the occurrenceID field. The institution holding the specimen should assign and preserve this identifier.

Detecting duplicates: Duplicates can be detected reliably ONLY if the expectation above is met. Unless consistent identifiers are present in the aggregated data, and until the community can formulate viable use cases on the desired handling of duplicate records in the portal, iDigBio does not attempt to flag these records.

After your data have been ingested

After your data have been ingested the first time, iDigBio staff will let you know by sending you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. This report page has several features that will help you make improvements to your data on subsequent updates:

  • 'Data Corrected' tab - these are fields that have been updated in our search index layer to help search in the aggregate.
    • At the very least, check for the geo-correctness of your georeferenced data: click on any flag to see the records that have been corrected.
  • 'Data Use' tab - these are individual reports of your data being used by visitor to the portal or looking for data using the API.
  • 'Raw' - this is handy for contrasting what you sent us versus what is in the index layer. When data are dowloaded, the resulting DwC-A contains both versions.

Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.

There is a lot of attention now by data providers, data aggregators and especially by data users/researchers on improving the quality of mobilized data. Each recordset that we ingest has a data quality flags report (see above example link). If the data have also been ingested by GBIF, see their documentation here: https://github.com/gbif/ipt/wiki/dataQualityChecklist#introduction. There is work on-going at TDWG to standardize what is flagged and what it is called.

Error handling

When data are received from the provider during the mobilizing process step, they are evaluated for fitness. Once the evaluation is successful, the ingestion process moves from mobilizing to ingesting, and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the mobilizing staff re-submit the data to the ingesting staff.

iDigBio IPT hosting

If you are unable to host your own IPT, please contact us to discuss available options.

We have a GBIF registered resource here http://ipt.idigbio.org

If this option is found to be appropriate for your data, we can also assist you in getting GBIF USGS endorsement for your data in our IPT. Please note that this endorsement is applicable to US-based datasets only.

Please contact us if you are interested in discussing this option for your data.

If you have data currently hosted on our IPT and need to send us updates: when sending us CSV updates to your dataset, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append. Be mindful that if you re adding fields to your export, to put them at the end of the columns in your CSV, because IPT is not set up to look for your mapped fields in a different order than the original.

Sample scenarios of data transformations to prepare data for ingestion

Advertising your data on iDigBio on your website

We encourage you to post a link on your institution's website informing users that they will also find your data on iDigBio's portal.

Please look here for logo material: https://www.idigbio.org/wiki/index.php/IDigBio_Logo

and consider making the link to be to your publishers page, something like:

https://www.idigbio.org/portal/recordsets/c50755ff-ca6d-4903-8e39-8b0e236c324f

where the UUID on the end of this link belongs to your recordset. The link to your recordset can be found here: iDigBio publishers

Additional references

If you want to learn about acceptable Creative Commons licenses in iDigBio:

Data ingestion report, progress so far

Provider assistance

=LOWER(CONCATENATE("urn:uuid:",DEC2HEX(RANDBETWEEN(0,4294967295),8),"-",DEC2HEX(RANDBETWEEN(0,65535),4),"-",DEC2HEX(RANDBETWEEN(16384,20479),4),"-",DEC2HEX(RANDBETWEEN(32768,49151),4),"-",DEC2HEX(RANDBETWEEN(0,65535),4),DEC2HEX(RANDBETWEEN(0,4294967295),8)))

Process terminology for iDigBio mobilization and ingestion staff

Processing steps, each step has a start and an end, signifying that it has moved to the next step.

IngestionProcess.gif
  • negotiating - the process of determining provider's interest in data ingestion
    • begins with email invitation to providers (in institutions, aggregators) to invite them to send their data to iDigbio specimen data portal
    • open a Redmine ticket in project=Data Mobilizing
    • ends with data exported by provider, ready for inspection and ingestion.
  • mobilizing - the process of evaluating data being fit for ingestion
    • begins with provider exported data and cursory inspection
    • fill in this table with provider info: eml.xml, unless there is a good eml.xml file available (e.g., from a DwC Archive)
    • ends with data passing inspection and passing to ingesting state, Redmine ticket changes to assignee=cyberinfrastructure team
  • ingesting - the process of ingesting provider's data
    • begins with Redmine ticket change to assignee=cyberinfrastructure team
    • ends with
      • data successfully ingested, ready for consumption in the portal
      • report sent back to data mobilizing staff
      • report sent to provider. Reference: iDigBio portal Publishers page
      • Redmine ticket set to Status= Closed
  • evaluating - the process of evaluating a failure to be ingested
    • begins with ingestion failure
      • evaluate ingestion failure, if data error - send it back to mobilizing state for corrections or
      • evaluate ingestion failure, if ingestion error - make corrections
    • ends with data re-submission to ingesting state