Talk:Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
No edit summary
 
(28 intermediate revisions by 3 users not shown)
Line 1: Line 1:
--[[User:Dpaul|Dpaul]] ([[User talk:Dpaul|talk]]) 16:49, 9 January 2014 (EST): Regarding data@idigbio.org
TO DO:
now that DP has a gator link, does this mean I can be added to the mailing list that "sees" data@idigbio.org


Where do users send email if they have a question
data@idigbio.org
the feedback (or both)?


(At Morphbank, to keep this straightforward, all email requests for help go to morphbank@scs.fsu.edu We do not have 2 separate paths for help / issues requests. I am assuming (I think) that clicking to send "feedback" generates a Redmine ticket (efficient and transparent). But, data@idigbio.org is not transparent.
Add links to the term definitions when they are mentioned.


About this Section
e.g.


When your data are ready for ingestion, please register your collection here. You will need to have an established login already, and be logged in. Here are the steps to take:
[http://purl.org/dc/terms/identifier dc:identifier] for "dc:identifier"


    http://portal.idigbio.org/register


Log into iDigBio if you have not already done so, your regular iDigBio.org login credentials are sufficient, there is no special login for the portal:


    https://www.idigbio.org/auth/login.php
Some DRAFT changes...


Once you are logged in, go to this link:


    http://portal.idigbio.org/register
<pre>
  <coreid index="0" />
  <field index="1" term="http://purl.org/dc/terms/identifier"/>
  <field index="2" term="http://purl.org/dc/terms/type"/>
  <field index="3" term="http://purl.org/dc/terms/format"/>
  <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
  <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
  <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
  <field index="7" term="http://purl.org/dc/terms/creator"/>
  <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
  <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
  <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
  <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
  <field index="13" term="http://purl.org/dc/terms/format"/>


If you are already on the portal page, the 'Register A Collection' is in the menu under your login name in the upper right of the page.
</pre>


----


When your data are ready for ingestion, please see the next steps.


#Get an iDigBio account for yourself (if you don't have one yet). https://www.idigbio.org/auth/login.php
==Packaging for images / media objects==
##This are the only login credentials you will need.
Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's  while preparing your media.
#Log in with your iDigBio account username and password. https://www.idigbio.org/auth/login.php
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
#Register your collection. http://portal.idigbio.org/register OR
*Each media record should have a unique (within the dataset) identifier in the ''identifier'' field.
#Register your collection at [http://grbio.org/ GRBIO]
*If providing media records with specimen data records, here are the important fields to fill in
:::Repository: http://grbio.org/find-biorepositories OR
** sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
:::Institutional Collections: http://grbio.org/find-institutional-collections
***'''id (coreid)''' = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. <pre>UUID GOES HERE</pre><pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
#If you are already on the portal page, the 'Register A Collection' is in the menu under your login name in the upper right of the page.
***'''identifier  (dcterms:identifier or dc:identifier)''' = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.<pre>UUID GOES HERE</pre> <pre>URL goes here</pre>
***'''type  (dcterms:type)''' = .... <pre>StillImage</pre>
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG</pre>
***'''providerManagedID (ac:providerManagedID)''' =  if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4  (optional)</pre>
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
 
 
 
Here are further recommended fields to fill in:
{| class="wikitable" border="1"
|-
! scope="col" width="15%" | AC  Term
! scope="col" width="45%" class="sortable"| Sample data
! scope="col" width="45%" class="sortable"| Notes
|-
|valign="top"|ac:associatedSpecimenReference
|valign="top"|0e1e12ed-2261-42db-8719-ee98532dab06
|valign="top"|A reference to a specimen associated with this resource.
|-
|valign="top"|dc:rights or dcterms:rights
|valign="top"|dc:rights -  “CC BY-NC"<br>
dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/
|valign="top"|preferred - dcterms:rights
|-
|valign="top"|ac:licenseLogoURL
|valign="top"|http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
|valign="top"|
|-
|valign="top"|xmpRights:Owner
|valign="top"|New York Botanical Garden
|valign="top"|A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
|-
|valign="top"|dc:creator
|valign="top"|"New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden"
|valign="top"|The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
|-
|valign="top"|dc:type
|valign="top"| StillImage, Sound, MovingImage
|valign="top"|
|-
|valign="top"|dcterms:title
|valign="top"|herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
|valign="top"|
|-
|}
*'''Note to aggregators''': In the case where the data are coming from an aggregator, an additional ''recordId'' field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
*'''Terms''': Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
*'''License''': Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.
Possible licenses:
* CC0: http://creativecommons.org/publicdomain/zero/1.0/
* CC BY: http://creativecommons.org/licenses/by/4.0/
* CC BY-NC: https://creativecommons.org/licenses/by-nc/4.0/
*The media records represent a one-to-one relationship between the media object (the fit-for-display best quality JPG, in the case of images, for example) and the specimen record. There is no need to include links to any other forms of the media, like an enclosing webpage, or thumbnails. Below is some guidance on handling special cases. If none of these media attachment rules make sense to you, please get in touch with us for further assistance.
If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).
 
===Best practice for getting Audubon Core images linked to specimen records - special cases===
 
{| class="wikitable unsortable" border="1"
|-
! scope="col" width="25%" class="unsortable" | Relationship
! scope="col" width="25%" class="unsortable" | Supported by
! scope="col" width="25%" class="unsortable" | Core Type
! scope="col" width="25%" class="unsortable" | Extensions
|-
|valign="top"|One-specimen-record-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core
|-
|valign="top"|Many-specimen-records-to-one-media file
|valign="top"|IPT 2.2/Custom DwC-A
|valign="top"|Audubon Core
|valign="top"|Specimen (DwC)
|-
|valign="top"|Many-specimen-records-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core + Relationship
|-
|}
 
Keep in mind that:
* DwC-A is a set of files: a core type + a number of extensions
* All files/tables (core or extension) need to have a unique identifier

Latest revision as of 16:10, 2 June 2017

TO DO:


Add links to the term definitions when they are mentioned.

e.g.

dc:identifier for "dc:identifier"


Some DRAFT changes...


   <coreid index="0" />
   <field index="1" term="http://purl.org/dc/terms/identifier"/>
   <field index="2" term="http://purl.org/dc/terms/type"/>
   <field index="3" term="http://purl.org/dc/terms/format"/>
   <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
   <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
   <field index="7" term="http://purl.org/dc/terms/creator"/>
   <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
   <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
   <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
   <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="13" term="http://purl.org/dc/terms/format"/>


Packaging for images / media objects

Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
  • Each media record should have a unique (within the dataset) identifier in the identifier field.
  • If providing media records with specimen data records, here are the important fields to fill in
    • sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
      • id (coreid) = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core.
        UUID GOES HERE
        urn:catalog:institutionCode:collectionCode:catalogNumber
      • identifier (dcterms:identifier or dc:identifier) = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.
        UUID GOES HERE
        URL goes here
      • type (dcterms:type) = ....
        StillImage
      • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible)
        image/jpeg
      • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page.
        http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG
      • providerManagedID (ac:providerManagedID) = if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field.
        urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier