CYWG iDigBio DwC-A Pull Ingestion

From iDigBio
Jump to navigation Jump to search


iDigBio DwC-A Pull Ingestion Through RSS feed

Semi-automated ingestion of data from a provider into iDigBio is possible via the use of a Really Simple Syndication (RSS) feed. The RSS feed contains metadata about the dataset as well as links to dataset files.

Accepted dataset formats are:

  • Comma-separated value file or "CSV",
  • Zipped single file CSV (.csv.zip) or "CSV-ZIP", and
  • DwC-A (Occurrence as a core and Audubon-Core extension are currently handled) or "DWCA".

Appropriate links to RSS feed or file are to be e-mailed to iDigBio.

To facilitate creation of a custom RSS feed containing the fields "title", "id", "type", "recordtype", "description", "link", "ipt:eml", and "pubDate", iDigBio makes a simple PHP script available at GitHub, which generates an output as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:ipt="http://ipt.gbif.org/">
	<channel>
		<title>iDigBio Feeder RSS Feed</title>
		<link>http://feeder.idigbio.org/rss.php</link>
		<description>RSS Feed for iDigBio CSV Datasets.</description>
		<language>en-us</language>
		<item>
			<title>Archbold Biological Station</title>
			<id>http://feeder.idigbio.org/datasets/ABS_iDigBio</id>
			<type>CSV</type>
			<recordtype>occurrence</recordtype>
			<description>Example of CSV dataset only with specimens.</description>
			<link>http://feeder.idigbio.org/datasets/ABS_iDigBio.csv</link>
			<ipt:eml>http://feeder.idigbio.org/eml/ABS_iDigBio.xml</ipt:eml>
			<pubDate>Wed, 14 May 2014 11:31:45 -0400</pubDate>
		</item>
		<item>
			<title>Invertnet Images</title>
			<id>http://feeder.idigbio.org/datasets/idigbio-invertnet</id>
			<type>CSV</type>
			<recordtype>multimedia</recordtype>
			<description>Example of a CSV dataset only with images.</description>
			<link>http://feeder.idigbio.org/datasets/idigbio-invertnet.csv</link>
			<ipt:eml>http://feeder.idigbio.org/eml/idigbio-invertnet.xml</ipt:eml>
			<pubDate>Fri, 18 Apr 2014 10:16:42 -0400</pubDate>
		</item>
		<item >
			<title>ASU-ASUHIC DwC-Archive</title>
			<id>98d9b8ed-08d6-47fc-b324-2853e44d75d1</id>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<image>http://symbiota4.acis.ufl.edu/scan/portal/images/collicons/asu.jpg</image>
			<description>Darwin Core Archive for Arizona State University Hasbrouck Insect Collection</description>
			<link>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.zip</link>
			<ipt:eml>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.eml</ipt:eml>
			<pubDate>Wed, 14 May 2014 09:58:23</pubDate>
		</item>
		<item>
			<title>Test Set ZIP</title>
			<id>http://localhost/datasets/test.csv.zip</id>
			<type>CSV-ZIP</type>
			<description>A Test Dataset</description>
			<link>datasets/test.csv.zip</link>
			<pubDate>Thu, 15 Nov 2012 14:29:45 -0500</pubDate>
		</item>
	</channel>
</rss>

Choosing how to provide RSS

An arbitrary data provider should adopt one of the existing publishing platforms:

  1. A provider who already has their data in Symbiota, or who would otherwise benefit from joining one of the Symbiota portals, should use Symbiota's RSS publishing mechanism.
  2. Intermediate users with the ability to run a server can run GBIF's IPT software which includes RSS publishing capability.
  3. Novice users, or users who simply have no server access, can serve datasets via the iDigBio RSS Feeder or via an existing IPT installation (e.g., VertNet, or iDigBio). Data mobilizers are ready to get your data onboard.
  4. An expert user with the ability to run a server and an existing DwC-A/CSV generation mechanism, who is comfortable handling both XML and character encoding issues, can produce a custom RSS feed. The iDigBio RSS Feeder codebase is available to use as a template. For example, this expert could utilize just the rss.php file and replace the CSV config files with database calls.

Requirements

Software providers who wish to build a new data publishing mechanism should follow the publishing feed pattern of one of the four existing well-known feed generators. The feed should comply with either the RSS 2.0 or Atom 1.0 publishing specifications. The common fields for each type (IPT, Symbiota, Feeder, or EasyCapture) are listed below. This ensures that RSS feed consumers can properly parse out the needed fields to access the data files and metadata about the datasets being provided.

Publishing Feed - field requirements at the feed level
Field Channel Description Requirement Example value
title human-friendly title / name of publishing feed MUST Arthropod Easy Capture (AMNH)
description human-friendly description of publisher or publishing feed SHOULD Arthropod Easy Capture rss feed


Publishing Feed - field requirements for each published dataset (each "item" in the feed)
Dataset field Dataset field description Requirement Example value
Title Human-friendly name of dataset SHOULD Fossil collection (F) of the Muséum national d'Histoire naturelle (MNHN - Paris)
Description Human friendly description of dataset) SHOULD This database PALAEO concerns the fossil collections of the Muséum national d'Histoire naturelle of Paris. It comprises information on all groups of animals, plants and microfossils, and on several preparations for anatomy / histology studies, and includes also ichnology records. These historical collections gathered from the beginning of the 18th century with the birth of palaeontology...
Dataset ID Unique identifier for the dataset in the feed MUST b275a4c1-9859-4f3c-8ead-d86dde820fbc
Data File Link Link to DwC archive MUST http://collections.mnhn.fr/ipt/archive.do?r=mnhn-f
Metadata Link Link to .eml MUST http://collections.mnhn.fr/ipt/eml.do?r=mnhn-f
Publication Date Dataset publication Date in RSS-compatible format MUST Thu, 04 Jun 2015 17:59:33 +0200


Publishing feed XML field names by publisher type
Feed Type IPT Symbiota Feeder EasyCapture
Title title title title title
Description description description description description
Dataset ID guid guid id guid
Data File Link ipt:dwca link link link
Metadata Link ipt:eml emllink ipt:eml emllink
Publication Date pubDate pubDate pubDate pubDate



The publishing feed must include all of the following pieces of information for each dataset:

  1. a guid for the dataset feed entry (feed url + unique item identifier is sufficient)
  2. a link to the Darwin Core Archive dataset file which contains the actual occurrence records and any DwC extensions
  3. a link to the EML file which contains metadata about the dataset contents and the source collection
  4. a publication date for the most recent date of the dataset update

Examples of RSS feeds from various systems

RSS generated by iDigBio Feeder (http://feeder.idigbio.org/rss.php)

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:ipt="http://ipt.gbif.org/" version="2.0">
  <channel>
    <title>iDigBio Feeder RSS Feed</title>
    <link>http://feeder.idigbio.org/rss.php</link>
    <description>RSS Feed for iDigBio CSV Datasets.</description>
    <language>en-us</language>
    <item>
      <title>Archbold Biological Station</title>
      <id>http://feeder.idigbio.org/datasets/ABS_iDigBio</id>
      <type>CSV</type>
      <recordtype>occurrence</recordtype>
      <description/>
      <link>http://feeder.idigbio.org/datasets/ABS_iDigBio.csv</link>
      <ipt:eml>http://feeder.idigbio.org/eml/ABS_iDigBio.xml</ipt:eml>
      <pubDate>Wed, 14 May 2014 11:31:45 -0400</pubDate>
    </item>
    <item>
      <title>Carnegie Museum of Natural History Vertebrate Paleontology</title>
      <id>http://feeder.idigbio.org/datasets/Carnegie_VertPaleo</id>
      <type>CSV</type>
      <recordtype>occurrence</recordtype>
      <description/>
      <link>http://feeder.idigbio.org/datasets/Carnegie_VertPaleo.csv</link>
      <ipt:eml>http://feeder.idigbio.org/eml/Carnegie_VertPaleo.xml</ipt:eml>
      <pubDate>Tue, 01 Jul 2014 10:34:40 -0400</pubDate>
    </item>
  </channel>
</rss>

RSS generated by IPT (http://hymfiles.biosci.ohio-state.edu:8080/ipt/rss.do)

<?xml version="1.0"?>
<rss version="2.0" 
	xmlns:foaf="http://xmlns.com/foaf/0.1/" 
	xmlns:ipt="http://ipt.gbif.org/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
  <channel>
    <title>xBioD IPT in the Museum of Biological Diversity at the Ohio State University</title>
    <link>http://xbiod.osu.edu/ipt</link>
    <description>Resource metadata of xBioD IPT in the Museum of Biological Diversity at the Ohio State University</description>
    <language>en-us</language>
    <!-- RFC-822 date-time  / Wed, 02 Oct 2010 13:00:00 GMT -->
      <pubDate>Mon, 16 Dec 2013 09:51:00 -0500</pubDate>
      <lastBuildDate>Fri, 05 Jun 2015 17:07:07 -0400</lastBuildDate>
    <generator>GBIF IPT 2.1.1-r4640</generator>
      <webMaster>cora.1@osu.edu () ()</webMaster>
    <docs>http://cyber.law.harvard.edu/rss/rss.html</docs>
    <ttl>15</ttl>
      <geo:Point>
        <geo:lat>39.9971388</geo:lat>
        <geo:long>-83.0439822</geo:long>
      </geo:Point>
      <item>
        <title>C.A. Triplehorn Insect Collection (OSUC), Ohio State University - Version 82</title>
        <link>http://xbiod.osu.edu/ipt/resource.do?r=osuc</link>
        <description>Vouchered occurrence records for insects from the C.A. Triplehorn Insect Collection at the Ohio State University. <a href="http://xbiod.osu.edu/ipt/eml.do?r=osuc">EML</a></description>
        <author>cora.1@osu.edu</author>
          <ipt:eml>http://xbiod.osu.edu/ipt/eml.do?r=osuc</ipt:eml>
	        <dc:publisher>Norman Johnson Ohio State University<johnson.2@osu.edu></dc:publisher>
	        <dc:creator>Norman Johnson Ohio State University<johnson.2@osu.edu></dc:creator>
            <ipt:dwca>http://xbiod.osu.edu/ipt/archive.do?r=osuc</ipt:dwca>
          <pubDate>Fri, 05 Jun 2015 17:10:38 -0400</pubDate>
          <guid isPermaLink="false">84ab7b76-f762-11e1-a439-00145eb45e9a/v82</guid>
      </item>
  </channel>
</rss>

RSS generated by Symbiota (http://portal.neherbaria.org/portal/webservices/dwc/rss.xml)

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title>CNH portal Darwin Core Archive rss feed</title>
		<link>http://portal.neherbaria.org/portal/</link>
		<description>CNH portal Darwin Core Archive rss feed</description>
		<language>en-us</language>
		<item collid="27">
			<title>Harvard University-A DwC-Archive</title>
			<image>http://www.huh.harvard.edu/images/huh_logo_bw_100.png</image>
			<description>Darwin Core Archive for Herbarium of the Arnold Arboretum (Harvard University Herbaria)</description>
			<guid>http://portal.neherbaria.org/portal/collections/misc/collprofiles.php?collid=27</guid>
			<guid>80b71fde-2241-4777-bfd3-3bdd075b8ba5</guid>
			<emllink>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.eml</emllink>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<link>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.zip</link>
			<pubDate>Thu, 17 Apr 2014 11:49:03</pubDate>
		</item>
	</channel>
</rss>

RSS generated by Arthropod EasyCapture (http://www.amnh.begoniasociety.org/dwc/rss.xml)

<rss version="2.0">
	<channel>
		<title>Arthropod Easy Capture (AMNH)</title>
		<link>https://research.amnh.org/pbi/locality/</link>
		<description>Arthropod Easy Capture rss feed</description>
		<language>en-us</language>		
		<item ProjUID="2">
			<title>
				Plants, herbivores, and parasitoids: A model system for the study of tri-trophic associations project
			</title>
			<description>
				Tri-Trophic Thematic Collection Network, 2014 (and updates). Version: 18 Mar 2015. http://tcn.amnh.org/. National Science Foundation grant(s) EF#1115081, EF#1115103, EF#1115080, EF#1115144, EF#1115191, EF#1115104, EF#1115115
			</description>
			<guid>
				urn:uuid:f0cec69a-853c-11e4-8259-0026552be7ea
			</guid>
			<emllink>
				http://www.amnh.begoniasociety.org/dwc/AEC-TTD-TCN_DwC-A20150318.eml
			</emllink>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<link>
				http://www.amnh.begoniasociety.org/dwc/AEC-TTD-TCN_DwC-A20150318.zip
			</link>
			<pubDate>Wed, 18 Mar 2015 14:49:42</pubDate>
		</item>
				
		<item ProjUID="3">
			<title>
				Collaborative databasing of North American bee collections within a global informatics network project
			</title>
			<description>
				Digital Bee Collections Network, 2014 (and updates). Version: 18 Mar 2015. National Science Foundation grant DBI 0956388
			</description>
			<guid>
				urn:uuid:13674fa4-8611-11e4-8259-0026552be7ea
			</guid>
			<emllink>
				http://www.amnh.begoniasociety.org/dwc/AEC-DBCNet_DwC-A20150318.eml
			</emllink>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<link>
				http://www.amnh.begoniasociety.org/dwc/AEC-DBCNet_DwC-A20150318.zip
			</link>
			<pubDate>Wed, 18 Mar 2015 14:50:54</pubDate>
		</item>
		
	</channel>
</rss>

What happens when your RSS is ready?

Once the links are received, an iDigBio IT staff member goes through the URLs to verify they are functioning as we expect, adding them to the dataset manager. The dataset manager downloads and hashes the datasets and all individual records on a weekly basis, pre-validating the internal uniqueness of IDs within individual datasets (trivial collision). If the dataset is new, all records in the dataset are staged for ingestion. If the dataset is an update, the true difference (using hashes) between the current and new datasets are computed, the changes are staged for ingestion, committed to the idigbio specimen API, and elastic-search is reindexed.


All records in a dataset are expected to have a Globally Unique IDentifier (GUID). Consult our Data Ingestion Guidance for more information of the data requirements. For additional terms, not covered by DwC, the recommendation is to consult with the MISC WG on the existence of an alternative term or the need to create one. A laundry list of terms currently in use can be found at:


Go back to CYWG.