Data Ingestion Guidance: Difference between revisions

Jump to navigation Jump to search
(30 intermediate revisions by 2 users not shown)
Line 12: Line 12:


== First step to becoming a data provider  ==
== First step to becoming a data provider  ==
Sending your data to iDigBio is as simple as sending an email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have.
'''PUBLISH''': Publishing your data in iDigBio is as simple as sending a personal email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the iDigBio portal, you would be sending a link a Darwin Core archive on an RSS feed.


iDigBio accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
iDigBio's ingestion scripts accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]


Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
'''REGISTER''': For US-based institutions, verify that your institution and collection are correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.


== Data requirements for data providers  ==
== Data requirements for data providers  ==
Line 52: Line 52:
===Sending data to iDigBio===
===Sending data to iDigBio===
*An [[CYWG iDigBio DwC-A Pull Ingestion|RSS feed]] to a DwC-A for ready access and update is our preference
*An [[CYWG iDigBio DwC-A Pull Ingestion|RSS feed]] to a DwC-A for ready access and update is our preference
*Email the files to us
*Email the files to us for installation in our IPT for mobilization into a DwC-A, but only after we have had a discussion.


===Specimen metadata - GUIDs / identifiers (occurrenceID)===
===Specimen metadata - GUIDs / identifiers (occurrenceID)===
*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected. Identifiers recommendations:
*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected.  
 
Identifier recommendations:


*a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)<br>
*a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)<br>
Line 131: Line 133:
===Permission to ingest===
===Permission to ingest===
*the provider needs to have permission to submit their data
*the provider needs to have permission to submit their data
===Symbiota - GBIF - iDigBio field differences===
Neil Cobb, PI of SCAN and LepNet TCNs contributed this handy [https://www.idigbio.org//sites/default/files/sites/default/files/Symbiota_Fields_DwC_GBIF_iDigBio-2.xlsx explanation] of field use and differences between Symbiota, GBIF and iDigBio


===Data recommendations for optimal searchability and applicability in the aggregate===
===Data recommendations for optimal searchability and applicability in the aggregate===
Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal specimen record.
We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an '''index''' to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The ''scientific name'' is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] to correct typos and older names. When an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the indexed version of the specimen record. ''Kingdom'', when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended.  We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date.
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further 'taxonomy' information.
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further '[https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Taxonomy Taxonomy]' information.
====Data ownership====
====Data stewardship / ownership====
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.


====Taxonomy====
====Taxonomy====
Line 158: Line 153:
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. If you have any legitimate parts of the eventDate field, parse them out into the numeric individual ''day'', ''month'' and ''year'' fields.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000', 'unknown' and any other placeholder value.


====Geolocation====
====Geolocation====
Line 199: Line 203:
==Packaging for images / media objects - identifiers==
==Packaging for images / media objects - identifiers==
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
*Each media record should have a unique (within the dataset) identifier in the ''dc:identifier'' field.
*Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier
*If submitting media records with specimen data records, here are the critical fields to fill in:
*Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.
** sample of fully-populated AC record
 
***'''id (dc:identifier)''' = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
A pristine sample of a minimally-populated Audubon Core (AC) CSV published via an extension in a Darwin Core Archive:
***'''identifier (dc:identifier)''' = id of the media record - needs to be persistent and unique within Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre><pre>123456</pre>
 
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://example.com/IMAGES/00000001.jpg</pre>
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng
***'''providerManagedID (ac:providerManagedID)''' if you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)</pre>
</pre>
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
 
Another variation based on real-world data:
 
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng
</pre>
 
 
The column mappings are defined in the accompanying meta.xml.
 
 
 
If submitting media records with specimen data records, here are the critical fields to fill in:
*'''coreid''' - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples: <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre><pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre><pre>123456</pre>
*'''identifier''' ([http://purl.org/dc/terms/identifier dcterms:identifier] or [http://purl.org/dc/elements/1.1/identifier dc:identifier]) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID.  Examples: <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre>
*'''format''' ([http://purl.org/dc/elements/1.1/format dc:format]) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples: <pre>image/jpeg</pre><pre>audio/mpeg</pre>
*'''accessURI''' ([http://rs.tdwg.org/ac/terms/accessURI ac:accessURI]) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page. Examples: <pre>http://example.com/IMAGES/00000001.jpg</pre><pre>http://example.com/objects/987654321</pre>
*'''providerManagedID''' ([http://rs.tdwg.org/ac/terms/providerManagedID ac:providerManagedID]) =  (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples: <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre>
 
'''Note:''' dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.  Media embedded on a webpage is a considered a webpage and thus will not be treated as media.  accessURI should point to the media itself.




5,887

edits

Navigation menu