Data Ingestion Guidance: Difference between revisions

Jump to navigation Jump to search
m
Reverted edits by Dstoner (talk) to last revision by Joanna
(clarifications and improvements)
m (Reverted edits by Dstoner (talk) to last revision by Joanna)
Line 12: Line 12:


== First step to becoming a data provider  ==
== First step to becoming a data provider  ==
Sending your data to iDigBio is as simple as sending an email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have.
Publishing your data in iDigBio is as simple as sending a personal email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the portal, you would be sending a link a Darwin Core archive on an RSS feed.


iDigBio accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
iDigBio's ingestion scripts accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]


Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
Line 133: Line 133:


===Data recommendations for optimal searchability and applicability in the aggregate===
===Data recommendations for optimal searchability and applicability in the aggregate===
Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal specimen record.
We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an '''index layer''' to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The ''scientific name'' is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] to correct typos and older names. When an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the index layer of the specimen record. ''Kingdom'', when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended.  We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date.
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible.
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further '[https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Taxonomy Taxonomy]' information.
====Data ownership====
====Data ownership====
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.


====Taxonomy====
====Taxonomy====
Line 158: Line 149:
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.


====Geolocation====
====Geolocation====
Line 199: Line 199:
==Packaging for images / media objects - identifiers==
==Packaging for images / media objects - identifiers==
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling.
*Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier
*Each media record should have a unique (within the dataset) identifier in the ''dc:identifier'' field.
*Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.
 
A pristine sample of a minimally-populated AC CSV published via an extension in a Darwin Core Archive:
 
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng
</pre>
 
Another variation based on real-world data:
 
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng
</pre>
 
 
The columns are defined in the accompanying meta.xml:
 
<pre>
TBD
</pre>
 
 
*If submitting media records with specimen data records, here are the critical fields to fill in:
*If submitting media records with specimen data records, here are the critical fields to fill in:
**'''coreid''' - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples: <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre><pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre><pre>123456</pre>
** sample of fully-populated AC record
**'''identifier''' ([http://purl.org/dc/terms/identifier dcterms:identifier] or [http://purl.org/dc/elements/1.1/identifier dc:identifier]) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID.  Examples: <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre>
***'''id (dc:identifier)''' = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
**'''format''' ([http://purl.org/dc/elements/1.1/format dc:format]) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples: <pre>image/jpeg</pre><pre>audio/mpeg</pre>
***'''identifier (dc:identifier)''' = id of the media record - needs to be persistent and unique within Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre><pre>123456</pre>
**'''accessURI''' ([http://rs.tdwg.org/ac/terms/accessURI ac:accessURI]) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page. Examples: <pre>http://example.com/IMAGES/00000001.jpg</pre><pre>http://example.com/objects/987654321</pre>
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
**'''providerManagedID''' ([http://rs.tdwg.org/ac/terms/providerManagedID ac:providerManagedID]) =  (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples: <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre>
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://example.com/IMAGES/00000001.jpg</pre>
 
***'''providerManagedID (ac:providerManagedID)''' if you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)</pre>
'''Note:''' dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.  Media embedded on a webpage is a considered a webpage and thus will not be treated as media.  accessURI should point to the media itself.
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.




1,554

edits

Navigation menu