Talk:Data Ingestion Guidance

From iDigBio
Jump to: navigation, search

TO DO:


Add links to the term definitions when they are mentioned.

e.g.

dc:identifier for "dc:identifier"


Some DRAFT changes...


   <coreid index="0" />
   <field index="1" term="http://purl.org/dc/terms/identifier"/>
   <field index="2" term="http://purl.org/dc/terms/type"/>
   <field index="3" term="http://purl.org/dc/terms/format"/>
   <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
   <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
   <field index="7" term="http://purl.org/dc/terms/creator"/>
   <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
   <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
   <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
   <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="13" term="http://purl.org/dc/terms/format"/>


Packaging for images / media objects

Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
  • Each media record should have a unique (within the dataset) identifier in the identifier field.
  • If providing media records with specimen data records, here are the important fields to fill in
    • sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
      • id (coreid) = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core.
        UUID GOES HERE
        urn:catalog:institutionCode:collectionCode:catalogNumber
      • identifier (dcterms:identifier or dc:identifier) = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.
        UUID GOES HERE
        URL goes here
      • type (dcterms:type) = ....
        StillImage
      • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible)
        image/jpeg
      • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page.
        http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG
      • providerManagedID (ac:providerManagedID) = if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field.
        urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier