Talk:Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
No edit summary
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
--[[User:Dpaul|Dpaul]] 16:49, 9 January 2014 (EST)
TO DO:
#Regarding data@idigbio.org now that dp has a gator link, does this mean I can be added to the mailing list that "sees" data@idigbio.org--[[User:Dpaul|Dpaul]]
#Where do users send email if they have a question
##data@idigbio.org
##the feedback (or both)?--[[User:Dpaul|Dpaul]] ([[User talk:Dpaul|talk]]) 17:48, 9 January 2014 (EST)
#(At Morphbank, to keep this straightforward, all email requests for help go to morphbank@scs.fsu.edu
##We do not have 2 separate paths for help / issues requests.
##I am assuming (I think) that clicking to send "feedback" generates a Redmine ticket (efficient and transparent). But, data@idigbio.org is not transparent.--[[User:Dpaul|Dpaul]]
#About this Section: Registering Your Collection in Preparation for Data Ingestion
##I suggest a different order. See next.--[[User:Dpaul|Dpaul]]


When your data are ready for ingestion, please see the next steps.


#Get an iDigBio account for yourself (if you don't have one yet). https://www.idigbio.org/auth/login.php
Add links to the term definitions when they are mentioned.
##These are the only login credentials you will need.
 
#Log in with your iDigBio account username and password. https://www.idigbio.org/auth/login.php
e.g.
#Register your collection. http://portal.idigbio.org/register OR
 
#Register your collection at [http://grbio.org/ GRBIO]
[http://purl.org/dc/terms/identifier dc:identifier] for "dc:identifier"
##Repository: http://grbio.org/find-biorepositories OR
 
##Institutional Collections: http://grbio.org/find-institutional-collections
 
#If you are already on the portal page, the 'Register A Collection' is in the menu under your login name in the upper right of the page.
 
----
Some DRAFT changes...
#About this next section:Data Requirements
 
#I would avoid the word <strike>ownership</strike>, if at all possible to help the community get around this issue (eventually). This reinforces ideas / misconceptions, and adds to confusion about data, media (and copyright, and intellectual property, etc). Something like this for number 2...
 
##You have permission to contribute this dataset to iDigBio.
<pre>
#for number 3. do we need to explain or justify? how about
  <coreid index="0" />
::Data Format choices
  <field index="1" term="http://purl.org/dc/terms/identifier"/>
:::[http://code.google.com/p/gbif-ecat/wiki/DwCArchive DarwinCore archive format] OR
  <field index="2" term="http://purl.org/dc/terms/type"/>
:::CSV files mapped to [http://rs.tdwg.org/dwc/terms/index.htm Darwin Core] (and other relevant standards, example [http://terms.tdwg.org/wiki/Audubon_Core_Term_List Audubon Core])
  <field index="3" term="http://purl.org/dc/terms/format"/>
::Data Transfer
  <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
:::Darwin Core Archive files harvest via IPT and RSS
  <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
:::CSV files via (...)
  <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
#for number 4, please add UTF-8 reference. something like:
  <field index="7" term="http://purl.org/dc/terms/creator"/>
:UTF-8 encoding preferred (should be required).
  <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
::validate (or verify) that "special characters" (diacritics like umlauts, tilde, cedilla) are correct in your dataset.
  <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
----
  <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
From Morphbank http://www.morphbank.net/About/Manual/imagePhilosophy.php to see how we worded "permissions" issue (revolving around images).
  <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
----
  <field index="13" term="http://purl.org/dc/terms/format"/>
#These following links (to me) are not Data Requirements. (at the bottom of the page in review). They are '''Image/Media Issues''' or '''Image/Media Guidance'''
 
##Additional info about image format is here: https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations
</pre>
##If you need to learn about acceptable Creative Commons licenses in iDigBio: https://www.idigbio.org/content/idigbio-intellectual-property-policy
 
#Next, General Information (nothing to do with Data Requirements, etc.
 
##If you are contemplating writing a proposal (e.g., to NSF) and want to coordinate your data with iDigBio: https://www.idigbio.org/content/collaborating-idigbio-grant-proposals
 
##If you are brand new to iDigBio and looking for some entry-level info about the project, try here: </pre>
==Packaging for images / media objects==
Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's  while preparing your media.
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
*Each media record should have a unique (within the dataset) identifier in the ''identifier'' field.
*If providing media records with specimen data records, here are the important fields to fill in
** sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
***'''id (coreid)''' = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. <pre>UUID GOES HERE</pre><pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
***'''identifier  (dcterms:identifier or dc:identifier)''' = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.<pre>UUID GOES HERE</pre> <pre>URL goes here</pre>
***'''type  (dcterms:type)''' = .... <pre>StillImage</pre>
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG</pre>
***'''providerManagedID (ac:providerManagedID)''' =  if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4  (optional)</pre>
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
 
 
 
Here are further recommended fields to fill in:
{| class="wikitable" border="1"
|-
! scope="col" width="15%" | AC  Term
! scope="col" width="45%" class="sortable"| Sample data
! scope="col" width="45%" class="sortable"| Notes
|-
|valign="top"|ac:associatedSpecimenReference
|valign="top"|0e1e12ed-2261-42db-8719-ee98532dab06
|valign="top"|A reference to a specimen associated with this resource.
|-
|valign="top"|dc:rights or dcterms:rights
|valign="top"|dc:rights -  “CC BY-NC"<br>
dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/
|valign="top"|preferred - dcterms:rights
|-
|valign="top"|ac:licenseLogoURL
|valign="top"|http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
|valign="top"|
|-
|valign="top"|xmpRights:Owner
|valign="top"|New York Botanical Garden
|valign="top"|A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
|-
|valign="top"|dc:creator
|valign="top"|"New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden"
|valign="top"|The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
|-
|valign="top"|dc:type
|valign="top"| StillImage, Sound, MovingImage
|valign="top"|
|-
|valign="top"|dcterms:title
|valign="top"|herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
|valign="top"|
|-
|}
*'''Note to aggregators''': In the case where the data are coming from an aggregator, an additional ''recordId'' field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
*'''Terms''': Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
*'''License''': Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.
Possible licenses:
* CC0: http://creativecommons.org/publicdomain/zero/1.0/
* CC BY: http://creativecommons.org/licenses/by/4.0/
* CC BY-NC: https://creativecommons.org/licenses/by-nc/4.0/
*The media records represent a one-to-one relationship between the media object (the fit-for-display best quality JPG, in the case of images, for example) and the specimen record. There is no need to include links to any other forms of the media, like an enclosing webpage, or thumbnails. Below is some guidance on handling special cases. If none of these media attachment rules make sense to you, please get in touch with us for further assistance.
If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).
 
===Best practice for getting Audubon Core images linked to specimen records - special cases===
 
{| class="wikitable unsortable" border="1"
|-
! scope="col" width="25%" class="unsortable" | Relationship
! scope="col" width="25%" class="unsortable" | Supported by
! scope="col" width="25%" class="unsortable" | Core Type
! scope="col" width="25%" class="unsortable" | Extensions
|-
|valign="top"|One-specimen-record-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core
|-
|valign="top"|Many-specimen-records-to-one-media file
|valign="top"|IPT 2.2/Custom DwC-A
|valign="top"|Audubon Core
|valign="top"|Specimen (DwC)
|-
|valign="top"|Many-specimen-records-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core + Relationship
|-
|}
 
Keep in mind that:
* DwC-A is a set of files: a core type + a number of extensions
* All files/tables (core or extension) need to have a unique identifier

Latest revision as of 16:10, 2 June 2017

TO DO:


Add links to the term definitions when they are mentioned.

e.g.

dc:identifier for "dc:identifier"


Some DRAFT changes...


   <coreid index="0" />
   <field index="1" term="http://purl.org/dc/terms/identifier"/>
   <field index="2" term="http://purl.org/dc/terms/type"/>
   <field index="3" term="http://purl.org/dc/terms/format"/>
   <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
   <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
   <field index="7" term="http://purl.org/dc/terms/creator"/>
   <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
   <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
   <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
   <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="13" term="http://purl.org/dc/terms/format"/>


Packaging for images / media objects

Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
  • Each media record should have a unique (within the dataset) identifier in the identifier field.
  • If providing media records with specimen data records, here are the important fields to fill in
    • sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
      • id (coreid) = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core.
        UUID GOES HERE
        urn:catalog:institutionCode:collectionCode:catalogNumber
      • identifier (dcterms:identifier or dc:identifier) = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.
        UUID GOES HERE
        URL goes here
      • type (dcterms:type) = ....
        StillImage
      • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible)
        image/jpeg
      • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page.
        http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG
      • providerManagedID (ac:providerManagedID) = if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field.
        urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier