1,554
edits
(clarifications and improvements) |
|||
Line 12: | Line 12: | ||
== First step to becoming a data provider == | == First step to becoming a data provider == | ||
Sending your data to iDigBio is as simple as sending an email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. | |||
iDigBio | iDigBio accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]] | ||
Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed. | Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed. | ||
Line 133: | Line 133: | ||
===Data recommendations for optimal searchability and applicability in the aggregate=== | ===Data recommendations for optimal searchability and applicability in the aggregate=== | ||
Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal specimen record. | |||
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom | We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. | ||
====Data ownership==== | ====Data ownership==== | ||
*'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes. | *'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes. | ||
*'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes). | *'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes). | ||
====Measurements and dates==== | |||
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. | |||
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data). | |||
====Data tics==== | |||
*'''Escapes''': do not use unescaped newline or tab characters in text fields. | |||
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for. | |||
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value. | |||
====Taxonomy==== | ====Taxonomy==== | ||
Line 149: | Line 158: | ||
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode | *'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode | ||
*'''vernacularName''': include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName | *'''vernacularName''': include common names for broader audience findability. For details see: http://rs.tdwg.org/dwc/terms/#vernacularName | ||
====Geolocation==== | ====Geolocation==== | ||
Line 199: | Line 199: | ||
==Packaging for images / media objects - identifiers== | ==Packaging for images / media objects - identifiers== | ||
Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media. | Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media. | ||
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is | *Firstly, adding a field in the occurrence file for ''associatedMedia'' is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata. | ||
*Each media record should have a unique | *Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier | ||
*Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column. | |||
A pristine sample of a minimally-populated AC CSV published via an extension in a Darwin Core Archive: | |||
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage | |||
e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng | |||
</pre> | |||
Another variation based on real-world data: | |||
<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage | |||
urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng | |||
</pre> | |||
The columns are defined in the accompanying meta.xml: | |||
<pre> | |||
TBD | |||
</pre> | |||
*If submitting media records with specimen data records, here are the critical fields to fill in: | *If submitting media records with specimen data records, here are the critical fields to fill in: | ||
**'''coreid''' - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples: <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre><pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre><pre>123456</pre> | |||
**'''identifier''' ([http://purl.org/dc/terms/identifier dcterms:identifier] or [http://purl.org/dc/elements/1.1/identifier dc:identifier]) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID. Examples: <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre> | |||
**'''format''' ([http://purl.org/dc/elements/1.1/format dc:format]) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples: <pre>image/jpeg</pre><pre>audio/mpeg</pre> | |||
**'''accessURI''' ([http://rs.tdwg.org/ac/terms/accessURI ac:accessURI]) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page. Examples: <pre>http://example.com/IMAGES/00000001.jpg</pre><pre>http://example.com/objects/987654321</pre> | |||
**'''providerManagedID''' ([http://rs.tdwg.org/ac/terms/providerManagedID ac:providerManagedID]) = (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples: <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre> | |||
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI. | '''Note:''' dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI. Media embedded on a webpage is a considered a webpage and thus will not be treated as media. accessURI should point to the media itself. | ||
edits