Data Ingestion Guidance: Difference between revisions

Line 130: Line 130:


===Data recommendations for optimal searchability and applicability in the aggregate===
===Data recommendations for optimal searchability and applicability in the aggregate===
Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your 'raw' data. The results of that index-building exercised are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal record.
Optimizing the search experience means that data need to be as consistent and regular as possible. To that end, iDigBio constructs an index layer to accompany your as-offered 'raw' data. The results of that index-building exercise are reflected in the data quality flag report that accompanies every ingested dataset. When taxon ranks are missing, the scientific name is matched to the GBIF backbone [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy] and when an exact or fuzzy match is found, it is used as the authority to fill in and regularize the taxonomic information in the portal specimen record.
We support and encourage to use the GBIF recommended set of occurrence record fields found here: http://bid.gbif.org/en/community/data-quality/#occurrence
We support and encourage you to use the GBIF recommended set of occurrence record fields found here: http://bid.gbif.org/en/community/data-quality/#occurrence
*'''institutionCode''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name.
*'''institutionCode''' and '''ownerInstitutionCode''': we recommend that if you use ownerInstitutionCode in your data that you also fill in institutionCode. The former is typically used to indicate that the specimen is at location 'x' while the record is being provided by institution 'y'. While we do not require the use of institutionCode, it is likely to be the most agreed upon searchable information when thinking about the disparities in a precise institution name. Use it consistently in your occurrence records and follow the Index Herbariorum or the ASIH codes.
====Measurements and dates====
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
*'''eventDate''': put dates in [http://www.w3.org/TR/NOTE-datetime  ISO 8601] format, i.e., YYYY-MM-DD, e.g., 2014-06-22. The critical element in this date is a four character year. e.g., http://rs.tdwg.org/dwc/terms/#eventDate.
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
*'''Meters''': put elevation in METERS units in the elevation field without the units (e.g., the fields ''dwc:minimumElevationInMeters'' and ''dwc:maximumElevationInMeters'' already assume the numeric values are in meters, do not include the units with the data).
*'''Escapes''': do not use unescaped newline characters in text fields.
 
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Something like '?' is not a helpful value, and cannot be searched for.
====Data tics====
*'''Escapes''': do not use unescaped newline or tab characters in text fields.
*'''Data uncertainty''': use the remarks fields to express doubt or missing values in data, Using '?' is not a helpful value, and cannot be searched for.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.
*'''No '0'''': do not export '0' in fields to represent no value, e.g., lat or lon. This caution applies to '?', 'NA', '00/00/0000' and any other placeholder value.
*'''decimalLatitude''' & '''decimalLatitude''': make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see:  http://rs.tdwg.org/dwc/terms/#decimalLatitude.
 
*'''genus''', '''specificEpithet''', '''infraspecificEpithet''' & '''taxonRank''': parse taxon ranks. Note: if the identification is something like ''Aeus sp.'', the taxonRank=genus.
====Taxonomy====
*'''scientificName''': combine taxon ranks into the identification value.
*'''scientificName''': combine taxon ranks into the identification value, include author and year of applicable.
*'''kingdom''': include kingdom and other high level ranks (phylum/division, class, and order) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
*'''genus''', '''specificEpithet''', '''infraspecificEpithet''' & '''taxonRank''': parse taxon ranks.  
**Note: if the identification is something like ''Aeus sp.'', the taxonRank=genus.  
**Note: the value of taxonRank must be a rank that is a Darwin Core term. Many super/sub/infra ranks are not valid in this case. Put them instead into the higherClassification amalgamated string.
*'''kingdom''': include kingdom and other high level ranks ('''phylum/division''', '''class''', and '''order''' where applicable) to assure that the indexing layer will remain faithful to your data as ingested. Our data quality flags will indicate when any of the original ranks in the data do not match the taxon names in the GBIF backbone.
*'''family''': include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy]. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
*'''family''': include family. If higher ranks are not included in your data, we will intuit those ranks from family up for better searchability in our index using the [http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c GBIF backbone taxonomy]. Higher taxonomy is NOT intuited in the case where the DwC identification history extension is included in your archive.
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
*'''higherClassification''': include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
*'''higherClassification''': include parsed higher taxonomy classification, at least kingdom and family, and the intervening ranks if possible. For details see: http://rs.tdwg.org/dwc/terms/#higherClassification.
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''nomenclaturalCode''': very important when not ICBN or ICZN, e.g., using Phylocode
*'''country''': we use the ISO country names from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 to purify the portal indexed searching. (see data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). For example for the US, the DwC fields countryCode = USA and the country = United States of America.
*'''vernacularName''': include common names for broader audience findability. For details see:  http://rs.tdwg.org/dwc/terms/#vernacularName
 
====Geolocation====
*'''country''': we use the ISO country names from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 to purify the portal indexed searching. (see data quality flags: https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). For example for the US, the DwC field countryCode = USA and the country = United States of America.
*'''countryCode''': include a 3 character countryCode from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3. For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam. The 3-char code is more inclusive than the 2-char code.
*'''countryCode''': include a 3 character countryCode from here: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3. For details see:  http://rs.tdwg.org/dwc/terms/#countryCode. Using a code for country aids in situations where the correct spelling and timeframe of collection location is not known, e.g., Thailand, Siam. The 3-char code is more inclusive than the 2-char code.
*'''continent''': For details see: http://rs.tdwg.org/dwc/terms/#continent
*'''continent''': For details see: http://rs.tdwg.org/dwc/terms/#continent
*'''decimalLatitude''' & '''decimalLatitude''': make sure lat and lon coordinates are in decimal, and not N, S, E, W. For details see:  http://rs.tdwg.org/dwc/terms/#decimalLatitude.
====Aggregating data within a record====
*'''dynamicProperties''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
*'''dynamicProperties''': when including data in the dynamicProperties field, please use JSON format. For details see: http://rs.tdwg.org/dwc/terms/#dynamicProperties.
====Collection Event====
*'''recordNumber''' or '''fieldNumber''': in our experience botanists use recordNumber and all others who have collection events use fieldNumber.
*'''recordNumber''' or '''fieldNumber''': in our experience botanists use recordNumber and all others who have collection events use fieldNumber.
Other fields for completeness that can be configured as defaults in IPT for all records:
 
====Other fields for completeness that can be configured as defaults in IPT for all records====
*'''basisOfRecord'''="PreservedSpecimen" or "FossilSpecimen". For details see: http://rs.tdwg.org/dwc/terms/#basisOfRecord
*'''basisOfRecord'''="PreservedSpecimen" or "FossilSpecimen". For details see: http://rs.tdwg.org/dwc/terms/#basisOfRecord
*'''type'''="Physical Object" For details see: http://rs.tdwg.org/dwc/terms/#type
*'''type'''="Physical Object" For details see: http://rs.tdwg.org/dwc/terms/#type
*'''language'''= "en" For details see: http://rs.tdwg.org/dwc/terms/#language
*'''language'''= "en" For details see: http://rs.tdwg.org/dwc/terms/#language
====Dataset metadata (information about the dataset as a whole)====
If you are building an archive via IPT, Symbiota or some other means, be sure to include projectID (the grant number) in the EML file (on the 'Project Data' tab in IPT). This will greatly increase the correct and complete attribution your data gets when it is used by researchers.
*ProjectID


Anyone considering contributing data should read these [[Data_Problems |anecdotes]]. They come from users of iDigBio's aggregated data, and reveal issues of data quality.
Anyone considering contributing data should read these [[Data_Problems |anecdotes]]. They come from users of iDigBio's aggregated data, and reveal issues of data quality.
5,887

edits