1,554
edits
(clarifications and improvements) |
|||
Line 12: | Line 12: | ||
== First step to becoming a data provider == | == First step to becoming a data provider == | ||
Publishing your data in iDigBio is as simple as sending a personal email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the portal, you would be sending a link a Darwin Core archive on an RSS feed. | |||
iDigBio accepts specimen data and related media from '''any''' institution. If | iDigBio's ingestion scripts accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]] | ||
Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed. | Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed. | ||
Line 133: | Line 133: | ||
===Data recommendations for optimal searchability and applicability in the aggregate=== | ===Data recommendations for optimal searchability and applicability in the aggregate=== | ||
We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an '''index layer''' to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The ''scientific name'' is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] to correct typos and older names. When an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the index layer of the specimen record. ''Kingdom'', when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended. We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date. | |||
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. | We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further '[https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Taxonomy Taxonomy]' information. | ||
====Data ownership==== | ====Data ownership==== | ||
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes. | *'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes. | ||
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes). | *'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes). | ||
====Taxonomy==== | ====Taxonomy==== | ||
Line 158: | Line 149: | ||
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode | *'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode | ||
*'''vernacularName''': include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName | *'''vernacularName''': include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName | ||
====Measurements and dates==== | |||
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. | |||
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data). | |||
====Data tics==== | |||
*'''Escapes''': do not use unescaped newline or tab characters in text fields. | |||
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for. | |||
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value. | |||
====Geolocation==== | ====Geolocation==== | ||
Line 199: | Line 199: | ||
==Packaging for images / media objects - identifiers== | ==Packaging for images / media objects - identifiers== | ||
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media. | Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media. | ||
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is the | *Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling. | ||
*Each media record should have a unique | *Each media record should have a unique (within the dataset) identifier in the ''dc:identifier'' field. | ||
*If submitting media records with specimen data records, here are the critical fields to fill in: | *If submitting media records with specimen data records, here are the critical fields to fill in: | ||
**''' | ** sample of fully-populated AC record | ||
**'''identifier | ***'''id (dc:identifier)''' = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre> | ||
**'''format | ***'''identifier (dc:identifier)''' = id of the media record - needs to be persistent and unique within Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre><pre>123456</pre> | ||
**'''accessURI | ***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre> | ||
**'''providerManagedID | ***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://example.com/IMAGES/00000001.jpg</pre> | ||
***'''providerManagedID (ac:providerManagedID)''' = if you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4 (optional)</pre> | |||
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI. | |||
edits