Data Ingestion Guidance: Difference between revisions

Data Ingestion Guidance (view source)

Revision as of 10:45, 13 July 2015

1,013 bytes added , 13 July 2015

→‎Error handling

Joanna

5,887

edits

@@ Line 207: / Line 207: @@
 It is straightforward to set up a feed between KE EMu and an IPT instance from which iDigBio can harvest.  Perhaps the simplest approach is to use the scheduled operations facility in EMu to write a template that generates an output file (e.g., csv, txt) containing Darwin Core metadata to be ingested by the IPT.  This output file can be produced automatically via operations at whatever frequency is desirable.  Some mechanism can then be used to move the output file into a location where it is read by the IPT, either manually through the IPT UI or through a batch process.  At Yale, we automate the entire workflow using cron such that 10 IPT resources get reinstantiated from EMu every day.  The IPT uses MySQL as its metadata source and lives on a server separate from EMu. The output files from EMu are text files, which are scped from the EMu server to the IPT server, and used as input for daily MySQL table refreshes (truncate table xxx ; load data local infile 'yyy' into table xxx ;). In turn, the IPT is set to publish its 10 resources automatically on a daily basis.
+==Concern about duplicate record ingestion==
+Definition of a duplicate record in iDigBio:
+Duplicate records are two or more records in iDigBio that provide information on a single physical specimen. These records come to iDigBio from different sources. An example would be a record coming directly from the source where the physical specimen is preserved, and a copy of the information coming from an intermediary, an aggregator.
+iDigBio's expectation from providers:
+In order to facilitate detection of duplicates, iDigBio expects providers to use identical unique identifiers in the occurrenceID field. The institution holding the specimen should assign and preserve this identifier.
+Detecting duplicates:
+Duplicates can be detected reliably ONLY if the expectation above is met. Unless consistent identifiers are present in the aggregated data, and until the community can formulate viable use cases on the desired handling of duplicate records in the portal, iDigBio does not attempt to flag these records.
 ==Error handling==
 When data are received from the provider during the '''mobilizing''' process step, they are evaluated for fitness.