CYWG iDigBio DwC-A Pull Ingestion

From iDigBio
Revision as of 20:41, 20 May 2014 by Ammatsun (talk | contribs)
Jump to navigation Jump to search


iDigBio DwC-A Pull Ingestion Through RSS feed

Ingestion of batches of data from providers into the iDigBio Data Portal v1 API is pull based as:

  • a single dataset export in CSV format, or
  • a Really Simple Syndication (RSS) feed to Darwin Core Archives (DwC-A) or Comma-Separated Values (CSV) files.

Accepted formats are:

  • Comma-separated value file or "CSV",
  • Zipped single file CSV (.csv.zip) or "CSV-ZIP", and
  • DwC-A (Occurrence as a core and Audubon-Core extension are currently handled) or "DWCA".

Appropriate links to RSS feed or file are to be e-mailed to iDigBio.

To facilitate creation of a custom RSS feed containing the fields "title", "id", "type", "recordtype", "description", "link", "ipt:eml", and "pubDate", iDigBio makes a simple PHP script available at GitHub, which generates an output as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:ipt="http://ipt.gbif.org/">
	<channel>
		<title>iDigBio Feeder RSS Feed</title>
		<link>http://feeder.idigbio.org/rss.php</link>
		<description>RSS Feed for iDigBio CSV Datasets.</description>
		<language>en-us</language>
		<item>
			<title>Archbold Biological Station</title>
			<id>http://feeder.idigbio.org/datasets/ABS_iDigBio</id>
			<type>CSV</type>
			<recordtype>occurrence</recordtype>
			<description>Example of CSV dataset only with specimens.</description>
			<link>http://feeder.idigbio.org/datasets/ABS_iDigBio.csv</link>
			<ipt:eml>http://feeder.idigbio.org/eml/ABS_iDigBio.xml</ipt:eml>
			<pubDate>Wed, 14 May 2014 11:31:45 -0400</pubDate>
		</item>
		<item>
			<title>Invertnet Images</title>
			<id>http://feeder.idigbio.org/datasets/idigbio-invertnet</id>
			<type>CSV</type>
			<recordtype>multimedia</recordtype>
			<description>Example of a CSV dataset only with images.</description>
			<link>http://feeder.idigbio.org/datasets/idigbio-invertnet.csv</link>
			<ipt:eml>http://feeder.idigbio.org/eml/idigbio-invertnet.xml</ipt:eml>
			<pubDate>Fri, 18 Apr 2014 10:16:42 -0400</pubDate>
		</item>
		<item >
			<title>ASU-ASUHIC DwC-Archive</title>
			<id>98d9b8ed-08d6-47fc-b324-2853e44d75d1</id>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<image>http://symbiota4.acis.ufl.edu/scan/portal/images/collicons/asu.jpg</image>
			<description>Darwin Core Archive for Arizona State University Hasbrouck Insect Collection</description>
			<link>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.zip</link>
			<ipt:eml>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.eml</ipt:eml>
			<pubDate>Wed, 14 May 2014 09:58:23</pubDate>
		</item>
		<item>
			<title>Test Set ZIP</title>
			<id>http://localhost/datasets/test.csv.zip</id>
			<type>CSV-ZIP</type>
			<description>A Test Dataset</description>
			<link>datasets/test.csv.zip</link>
			<pubDate>Thu, 15 Nov 2012 14:29:45 -0500</pubDate>
		</item>
	</channel>
</rss>

Choosing how to provide RSS

An arbitrary data provider should adopt one of the existing publishing platforms:

  1. A provider who already has their data in Symbiota, or who would otherwise benefit from joining one of the Symbiota portals, should use Symbiota's publishing mechanism.
  2. An expert user with the ability to run a server and an existing DwC-A/CSV generation mechanism, who is comfortable handling both XML and character encoding issues, can adopt/adapt the feeder codebase for their use. For example, this expert could utilize just the rss.php file and replace the CSV config files with database calls.
  3. Intermediate users with the ability to run a server can run IPT.
  4. Novice users, or users who simply have no server access, can serve datasets via our feeder or via an existing IPT (e.g., VertNet). Data mobilizers are ready to get your data onboard.

Requirements

Software providers (Specify, EMu, AECD, SilverBiology) who wish to build a new data publishing mechanism should generally follow the same guidance as group 2. At this point, the equivalency of the formats is immaterial, as we want to make sure that the end result is consistent with our other providers. The only requirements are an RSS feed (or ATOM) that contains:

  1. a GUID for the recordset (feed url + unique item identifier is sufficient),
  2. a link to the recordset file, preferably a darwin core archive, but at minimum a UTF-8 encoded CSV,
  3. a link to metadata about the provider, preferably an eml file, but any metadata format (EML, ABCD, XML, JSON, YAML) will do in a pinch. The link can be implied if necessary (a guid is provided, and that guid and another path is sufficient to retrieve the metadata), and
  4. an update date that can be relied upon.

Examples from other systems

An example of RSS generated by IPT (http://hymfiles.biosci.ohio-state.edu:8080/ipt/rss.do):

<?xml version="1.0"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:ipt="http://ipt.gbif.org/">
	<channel>
		<title>xBioD IPT in the Museum of Biological Diversity at the Ohio State University</title>
		<link>http://hymfiles.biosci.ohio-state.edu:8080/ipt</link>
		<description>Resource metadata of xBioD IPT in the Museum of Biological Diversity at the Ohio State University</description>
		<language>en-us</language>
		<!-- RFC-822 date-time  / Wed, 02 Oct 2010 13:00:00 GMT -->
		<pubDate>Mon, 16 Dec 2013 09:51:00 -0500</pubDate>
		<lastBuildDate>Thu, 15 May 2014 13:15:00 -0400</lastBuildDate>
		<generator>GBIF IPT 2.0.5-r4398-security-update-1</generator>
		<webMaster>cora.1@osu.edu () ()</webMaster>
		<docs>http://cyber.law.harvard.edu/rss/rss.html</docs>
		<ttl>15</ttl>
		<geo:Point>
			<geo:lat>39.9971388</geo:lat>
			<geo:long>-83.0439822</geo:long>
		</geo:Point>
	</channel>
</rss>

An example of custom RSS feed from Symbiota (http://portal.neherbaria.org/portal/webservices/dwc/rss.xml):

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title>CNH portal Darwin Core Archive rss feed</title>
		<link>http://portal.neherbaria.org/portal/</link>
		<description>CNH portal Darwin Core Archive rss feed</description>
		<language>en-us</language>
		<item collid="27">
			<title>Harvard University-A DwC-Archive</title>
			<image>http://www.huh.harvard.edu/images/huh_logo_bw_100.png</image>
			<description>Darwin Core Archive for Herbarium of the Arnold Arboretum (Harvard University Herbaria)</description>
			<guid>http://portal.neherbaria.org/portal/collections/misc/collprofiles.php?collid=27</guid>
			<guid>80b71fde-2241-4777-bfd3-3bdd075b8ba5</guid>
			<emllink>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.eml</emllink>
			<type>DWCA</type>
			<recordType>DWCA</recordType>
			<link>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.zip</link>
			<pubDate>Thu, 17 Apr 2014 11:49:03</pubDate>
		</item>
	</channel>
</rss>

What happens when your RSS is ready?

Once the links are received, an iDigBio IT staff member goes through the URLs to verify they are functioning as we expect, adding them to the dataset manager. The dataset manager downloads and hashes the datasets and all individual records on a weekly basis, pre-validating the internal uniqueness of IDs within individual datasets (trivial collision). If the dataset is new, all records in the dataset are staged for ingestion. If the dataset is an update, the true difference (using hashes) between the current and new datasets are computed, the changes are staged for ingestion, committed to the idigbio specimen API, and elastic-search is reindexed.


All records in a dataset are expected to have a Globally Unique IDentifier (GUID). Consult our Data Ingestion Guidance for more information of the data requirements. For additional terms, not covered by DwC, the recommendation is to consult with the MISC WG on the existence of an alternative term or the need to create one. A laundry list of terms currently in use can be found at:


Go back to CYWG.