Data Ingestion Guidance: Difference between revisions

← Older edit

Data Ingestion Guidance (view source)

Revision as of 15:07, 18 June 2021

6,657 bytes added , 18 June 2021

→‎Darwin Core Validator Tools

Catchapman

250

edits

@@ Line 12: / Line 12: @@
 == First step to becoming a data provider  ==
-Sending your data to iDigBio is as simple as sending an email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have.
+'''PUBLISH''': Publishing your data in iDigBio is as simple as sending a personal email to [mailto:data@idigbio.org data@idigbio.org] to say where to pick it up for ingestion. If you need help compiling it into the acceptable formats, then get in touch with us to express your interest, and we'll help with what you currently have. Establish contact first. Unless you are thinking about mobilizing your data via our IPT, no data should change hands. If you want to have your data ingested by the iDigBio portal, you would be sending a link a Darwin Core archive on an RSS feed.
-iDigBio accepts specimen data and related media from '''any''' institution. If  you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
+iDigBio's ingestion scripts accepts specimen data and related media from '''any''' institution. If you are ready to discuss providing data to iDigBio, contact [mailto:data@idigbio.org data@idigbio.org] to register your interest and begin the process of preparing your data for ingestion. If you have a Darwin Core Archive (DwC-A), getting your data ingested by iDigBio could be as easy as telling us the RSS feed address on your network. Information about setting up an RSS feed can be found here: [[CYWG_iDigBio_DwC-A_Pull_Ingestion|Setting up an RSS feed]]
-Verify that your institution and collection is correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
+'''REGISTER''': For US-based institutions, verify that your institution and collection are correct here: https://www.idigbio.org/portal/collections. Submit corrections as needed.
+== Darwin Core Validator Tools ==
+'''IMPORTANT:''' ''Before'' you submit data to us: please ensure your data are [http://rs.tdwg.org/dwc/terms/ Darwin Core] (DwC) compliant. GBIF provides two useful data tools that can check data for DwC compliance.
+'''Data Validator''': https://www.gbif.org/tools/data-validator This is an in-depth tool that checks for syntactical correctness of a provided dataset as well as checking the validity of the content contained within the dataset. This tool is free to use, and '''requires logging in with a free GBIF account'''. (STRONGLY RECOMMENDED)
+'''Darwin Core Archive Validator''': https://tools.gbif.org/dwca-validator/ This is a simple tool that will check the structure of a provided Darwin Core Archive (DwC-A). This tool is less in-depth than the above Data Validator, but is also free to use, and '''does not require a login'''. (Acceptable)
+The above tools can be used on '''any dataset''', regardless if they are to be published on GBIF. Using these tools will not publish your data to GBIF.
 == Data requirements for data providers  ==
@@ Line 49: / Line 59: @@
 In the interest of people/researchers using your data in the aggregate, e.g., EOL, we encourage you to homogenize the rights information you provide. We recommend CC0 (http://creativecommons.org/publicdomain/zero/1.0/). Further info about Creative Commons licenses is below, under the 'providing media' section.
+===Note on Sensitive Data/Endangered Species Data===
+It is the responsibility of the data provider to obfuscate, mask, or exclude data related to sensitive/endangered species. If any of these data are included in a dataset, iDigBio will not take measures to exclude them. Providers may use [https://terms.tdwg.org/wiki/dwc:informationWithheld dwc:InformationWithheld] to indicate records that have sensitive information withheld; [http://portal.idigbio.org/portal/records/e6c5dffc-4ad1-4d9d-800f-5796baec1f65 see an example record here].
+===Note on Federal Data===
+iDigBio is actively avoiding the ingestion of federal recordsets into the portal. However, if federal records happen to be included with a valid non-federal specimen records dataset, iDigBio will not take any measures to filter or exclude the federal records.
 ===Sending data to iDigBio===
 *An [[CYWG iDigBio DwC-A Pull Ingestion|RSS feed]] to a DwC-A for ready access and update is our preference
-*Email the files to us
+*Email the files to us for installation in our IPT for mobilization into a DwC-A, but only after we have had a discussion.
 ===Specimen metadata - GUIDs / identifiers (occurrenceID)===
-*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected. Identifiers recommendations:
+*Each specimen record should have a unique (within the dataset) identifier in the ''dwc:occurrenceID'' field. When the ingestion software detects duplicate identifiers, the duplicated records are flagged as an error and are not ingested. This is the number one reason for records to be rejected.
+Identifier recommendations:
 *a UUID using URI syntax: (lowercase is preferred in the prefix) (preferred format of GUIDs in iDigBio)<br>
@@ Line 131: / Line 149: @@
 ===Permission to ingest===
 *the provider needs to have permission to submit their data
+===Symbiota - GBIF - iDigBio field differences===
+Neil Cobb, PI of SCAN and LepNet TCNs contributed this handy [https://www.idigbio.org//sites/default/files/sites/default/files/Symbiota_Fields_DwC_GBIF_iDigBio-2.xlsx explanation] of field use and differences between Symbiota, GBIF and iDigBio
 ===Data recommendations for optimal searchability and applicability in the aggregate===
-Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal specimen record.
+We optimize the search experience to make data as consistent and regular as possible. To that end, iDigBio constructs an '''index''' to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. The ''scientific name'' is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] to correct typos and older names. See the "Backbone matching" section of [https://www.gbif.org/infrastructure/processing GBIF Data Processing] for information on GBIF's methods.  When iDigBio finds an exact or fuzzy match in the GBIF backbone, the matched info is used as the authority to fill in and regularize the taxonomic information in the indexed version of the specimen record. ''Kingdom'', when provided, is used to stop shifting to a different kingdom in the event that the given rank and scientific name would force a change. If not enough clues are found, an identification can land in a completely different place in the taxonomy tree that the provider intended.  We encourage providers to supply GBIF with lists and corrections to help GBIF keep the backbone up to date.
-We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://bid.gbif.org/en/community/data-quality/#occurrence. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible.
+We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://www.gbif.org/publishing-data/quality. We don't have a long list of required fields ([[occurrenceID]], [[institutionCode]], [[scientificName]], [[kingdom]], [[taxonRank]], [[basisOfRecord]]), but we strongly recommend that you address as many of these fields below as possible. See below for further '[https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Taxonomy Taxonomy]' information.
-====Data ownership====
+====Data stewardship / ownership====
 *'''[[institutionCode]]''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
 *'''collectionCode''': this is very handy when you have multiple datasets in the portal, for distinguishing taxon group X from taxon group Y (e.g., lichens from bryophytes).
-====Measurements and dates====
-*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
-*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
-====Data tics====
-*'''Escapes''': do not use unescaped newline or tab characters in text fields.
-*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
-*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.
 ====Taxonomy====
@@ Line 158: / Line 169: @@
 *'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
 *'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
+====Measurements and dates====
+*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate. If you have any legitimate parts of the eventDate field, parse them out into the numeric individual ''day'', ''month'' and ''year'' fields.
+*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
+====Data tics====
+*'''Escapes''': do not use unescaped newline or tab characters in text fields.
+*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
+*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000', 'unknown' and any other placeholder value.
 ====Geolocation====
-*'''country''': we use the ISO country names from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 to purify the portal indexed searching. (see data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). For example for the US, the DwC field countryCode = USA and the country = United States of America.
+*'''country''': we recommend using the TGN preferred names [http://www.getty.edu/research/tools/vocabularies/tgn/ Getty Thesaurus of Geographic Names]. (See also the data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags).
-*'''countryCode''': include a 3 character countryCode from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3. For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam. The 3-char code is more inclusive than the 2-char code.
+*'''countryCode''': For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam.
 *'''continent''': For details see: http://rs.tdwg.org/dwc/terms/#continent
 *'''decimalLatitude''' & '''decimalLatitude''': make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see:  http://rs.tdwg.org/dwc/terms/#decimalLatitude.
@@ Line 168: / Line 188: @@
 *'''dynamicProperties''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
 *'''measurementOrFact''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#MeasurementOrFact
+====GenBank and other genetic sequence references====
+*'''associatedSequences'''
+The researchers who use our data are especially appreciative when collections people add genetic sequence identifiers to their specimen records ('|' separated list). For details see: http://rs.tdwg.org/dwc/terms/#associatedSequences
 ====Collection Event====
@@ Line 185: / Line 209: @@
 === Data downloads===
-When your data are downloaded by users of the portal,  both the raw and indexed data are included. For details, see https://www.idigbio.org/content/understanding-idigbios-data-downloads
+When your data are downloaded by users of the portal, both the raw and indexed data are included. Citation and attribution data is also included. For details, see https://www.idigbio.org/content/understanding-idigbios-data-downloads
 ===Using PhyloCode nomenclature===
@@ Line 195: / Line 219: @@
 ==Packaging for images / media objects - identifiers==
 Consult iDigBio's media policy: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 while preparing your media.
-*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not get the usual handling.
+*Firstly, adding a field in the occurrence file for ''associatedMedia'' is the least robust method to relate media with a specimen record and iDigBio discourages this practice. Media that is provided with sufficient metadata will be more useful for downstream users and receive additional handling such as thumbnails in the iDigBio portal. In the future, media may be searchable in iDigBio based on the provided media metadata.
-*Each media record should have a unique (within the dataset) identifier in the ''dcterms:identifier'' field.
+*Each media record should have a unique and persistent identifier in the column defined by http://purl.org/dc/terms/identifier
-*If submitting media records with specimen data records, here are the critical fields to fill in:
+*Columns are defined in the meta.xml so the column headers in the multimedia file itself are a convenience but not actually significant to the meaning or processing of the column.
-** sample of fully-populated AC record
-***'''id (dc:identifier)''' = (this is the coreid field in the Audubon Core extension file), it matches one identifier among the related specimen records <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
+A pristine sample of a minimally-populated Audubon Core (AC) CSV published via an extension in a Darwin Core Archive:
-***'''identifier  (dc:identifier)''' = id of the media record - needs to be unique within Audubon Core file, is the equivalent of the occurrenceID in the occurrence file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.<pre>urn:catalog:institutionCode:collectionCode:Image:catalogNumber</pre>
-***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
+<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
-***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG</pre>
+e24899c2-f13a-4d51-8733-bdf666b390d9,urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/00000001,https://creativecommons.org/publicdomain/zero/1.0/,Museum of the USA,John Smith,eng
-***'''providerManagedID (ac:providerManagedID)''' =  if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)</pre>
+</pre>
-Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
+Another variation based on real-world data:
+<pre>coreid,identifier,type,format,accessURI,rights,owner,creator,metadataLanguage
+urn:catalog:MUSA:fish:123,32e5da5d-c747-435c-a368-07d989259bf4,StillImage,image/jpeg,http://example.com/IMAGES/00000001.jpg,CC0,Museum of the USA,John Smith,eng
+</pre>
+The column mappings are defined in the accompanying meta.xml.
+If submitting media records with specimen data records, here are the critical fields to fill in:
+*'''coreid''' - If media data are being provided via an extension, the coreid field in the Audubon Core extension file is what links the media record to the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. The value in the extension coreid column will link to a value in the core file "id" column (normally column 0). Examples: <pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre><pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre><pre>123456</pre>
+*'''identifier''' ([http://purl.org/dc/terms/identifier dcterms:identifier] or [http://purl.org/dc/elements/1.1/identifier dc:identifier]) = The persistent and unique id of the media record within the Audubon Core file. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent. If you have multiple types of identifiers for a media, put the least stable here and the most stable in ac:providerManagedID.  Examples: <pre>urn:uuid:84fb24fa-fd15-476a-99a6-a7f876b87d08</pre>
+*'''format''' ([http://purl.org/dc/elements/1.1/format dc:format]) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible). Examples: <pre>image/jpeg</pre><pre>audio/mpeg</pre>
+*'''accessURI''' ([http://rs.tdwg.org/ac/terms/accessURI ac:accessURI]) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page. Examples: <pre>http://example.com/IMAGES/00000001.jpg</pre><pre>http://example.com/objects/987654321</pre>
+*'''providerManagedID''' ([http://rs.tdwg.org/ac/terms/providerManagedID ac:providerManagedID]) =  (Optional) If you have a stable UUID GUID for your media records and you have populated "dc:identifier" with a different type of identifier, place the guid in the optional ac:providerManagedID field. Examples: <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4</pre>
+'''Note:''' dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.  Media embedded on a webpage is a considered a webpage and thus will not be treated as media.  accessURI should point to the media itself.
@@ Line 313: / Line 356: @@
 == After your data have been ingested==
-After your data have been ingested the first time, iDigBio staff will let you know, and give you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. The report page has several features that will help you make improvements to your data on subsequent updates:
+After your data have been ingested the first time, iDigBio staff will let you know by sending you a link to your recordset, e.g., http://portal.idigbio.org/portal/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6. This report page has several features that will help you make improvements to your data on subsequent updates:
-* indications of data fields that will improve searchability in the aggregate
+* 'Data Corrected' tab - these are fields that have been updated in our search index layer to help search in the aggregate.
-* set you up to check data use by visitors to the iDigBio portal
+**At the very least, check for the geo-correctness of your georeferenced data: click on any flag to see the records that have been corrected.
-* check for geo-correctness of your georeferenced data. Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.
+* 'Data Use' tab - these are individual reports of your data being used by visitor to the portal or looking for data using the API.
-The tab related to 'Data Corrected' on this recordset page tells you about fields that have been improved in our search indices only, they are not changes we have made to your data.
+* 'Raw' - this is handy for contrasting what you sent us versus what is in the index layer. When data are dowloaded, the resulting DwC-A contains both versions.
+Click on the 'Search Recordset' button to view your records on a map. It is not uncommon for there to be transposed values in the lat/lon fields.
+There is a lot of attention now by data providers, data aggregators and especially by data users/researchers on improving the quality of mobilized data. Each recordset that we ingest has a data quality flags report (see above example link). If the data have also been ingested by GBIF, see their documentation here: https://github.com/gbif/ipt/wiki/dataQualityChecklist#introduction. There is work on-going at TDWG to standardize what is flagged and what it is called.
 ==Error handling==
@@ Line 323: / Line 370: @@
 Once the evaluation is successful, the ingestion process moves from '''mobilizing''' to '''ingesting''', and the data are submitted to the ingestion scripts by the cyberinfrastructure staff. If an error condition occurs, the staff evaluate whether it is a script error or a data error. If it is the latter, the staff sends an email to the mobilizing staff who may contact the provider for changes. When the errors have been addressed, the '''mobilizing''' staff re-submit the data to the '''ingesting''' staff.
 ==iDigBio IPT hosting==
-If you would like iDigBio to host your datasets, we have a GBIF registered resource here: http://ipt.idigbio.org
+If you are unable to host your own IPT, please contact us to discuss available options.
+We have a GBIF registered resource here http://ipt.idigbio.org
+If this option is found to be appropriate for your data, we can also assist you in getting GBIF USGS endorsement for your data in our IPT. Please note that this endorsement is applicable to US-based datasets only.
-We can also assist you in getting GBIF USGS endorsement for your data in our IPT (applicable to US-based datasets only).
+Please [https://www.idigbio.org/contact/Caitlin_%E2%80%9CCat%E2%80%9D_Chapman contact us]  if you are interested in discussing this option for your data.
-When sending us CSV updates to your dataset hosted in this IPT, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append. Be mindful that if you re adding fields to your export, to put them at the end of the columns in your CSV, because IPT is not set up to look for your mapped fields in a different order than the original.
+'''If you have data currently hosted on our IPT and need to send us updates''': when sending us CSV updates to your dataset, please send the whole dataset again, not only the updates. We do a replace records operation between the current data and the new data, rather than an append. Be mindful that if you re adding fields to your export, to put them at the end of the columns in your CSV, because IPT is not set up to look for your mapped fields in a different order than the original.
 == Sample scenarios of data transformations to prepare data for ingestion  ==
@@ Line 350: / Line 401: @@
 *https://www.idigbio.org/portal/publishers
 ===Provider assistance===
-*[[Media:ImageIngestionCheatSheet_Sheet1.pdf| How to use the image ingestion appliance and link to specimen records : image ingestion cheatsheet]]
+*<s>[[Media:ImageIngestionCheatSheet_Sheet1.pdf| How to use the image ingestion appliance and link to specimen records : image ingestion cheatsheet]]</s>
 *[[Media:GUIDgeneration.pdf| How to generate a UUID GUID in an Excel spreadsheet]]
 <pre>
@@ Line 377: / Line 428: @@
 ***data successfully ingested, ready for consumption in the portal
 ***report sent back to data mobilizing staff
-***report sent to provider. Reference: [https://www.idigbio.org/portal/publishers Publishers Report]
+***report sent to provider. Reference: [https://www.idigbio.org/portal/publishers iDigBio portal Publishers page]
 ***Redmine ticket set to Status= Closed

Data Ingestion Guidance: Difference between revisions

Data Ingestion Guidance (view source)

Revision as of 15:07, 18 June 2021

Navigation menu

Search