In this upcoming episode of Darwin Core Hour (DCH), we head under down under to join Arthur D. Chapman and Lee Belbin for a conversation about Darwin Core and Data Quality. We'll hear about collaborative efforts across the aggregator community (specifically GBIF, VertNet, ALA, and iDigBio) to harmonize data quality algorithms for the downstream data use by researchers, developers, etc. These efforts are the result of the Biodiversity Information Standards (TDWG) Data Quality Interest Group (DQIG) work to develop a suite of common, shared tests and data format expectations for use by aggregators.

Read more about these Test and Assertions on the DQIG Wiki https://github.com/tdwg/bdq/wiki/Task-Group-2-(Tests-and-Assertions)-of-the-'Data-Quality'-Interest-Group-seek-your-comments

Scientists, aggregators, custodians and curators of biodiversity data have varying understanding and requirements of ‘quality’ when it comes to biodiversity-related data. Some users are only interested in having accurate names of the organisms, others are focused on location, and others the date of collection, or of course various other of the 150+ Darwin Core terms. Aggregators are primarily interested in delivering any data they can find knowing that a subset will be useful to some users. They are also keen to present data in as comprehensive form as possible with documented quality so that users can determine if the data they seek is fit for their use. Data custodians and curators want the data that they are responsible for to be as good as it can be for as many purposes as possible. To date, testing for these data quality requirements has been highly inconsistent and largely haphazard, and the documentation and annotations that accompany those data similarly inconsistent.

Task Group 2 of the TDWG Data Quality Interest Group has been working over the past two years developing a set of core tests that can be consistently applied by all users, aggregators and data custodians. Set this task it was quickly realised that it was virtually impossible to run consistent tests for all the fields in the data bases – or even all the fields (terms) documented in the Darwin Core Standard. For this reason, a subset of Darwin Core Terms – those that represent the what, where and when of the data – were the focus. A bite of the core so to speak. There was also a recognition that such core tests could be implemented by most. Tests for all these were gathered from the key groups known to be conducting data quality tests and a consistent set of tests were aligned in a template based on a set of principles. A second step has been to develop a consistent set of annotations or assertions about the data so that reporting on the quality can be done in a consistent and stable manner. This will help users to know that the data they obtain from one source has been tested in the same way and documented in the same way as data from another source.

Generic code is now being developed for each of the 110 or so data quality tests along with test data sets for ensuring consistent implementations. Institutions can take the core set of tests (or a subset of them) and use the generic code and test data set to implement a consistent in-house data quality ‘test and assertion’ regime. Different Database Management Systems use different software and structures and thus local implementations may initially produce different results. By running the tests against the reference test data set, users will be able to determine if their implementation produces the consistent result that it was meant to and modify the implementation accordingly.

Several of the large data aggregators, including GBIF, the Atlas of Living Australia and iDigBio have agreed to implement the core set of tests and assertions and are in the process of implementing them now.

BDIQ GitHub:  https://github.com/tdwg/bdq
Data Quality Tests: http://bit.ly/2uY3FoA
Principles: http://bit.ly/2vqoGsK

Darwin Core has become a broadly-used standard for biodiversity data sharing since its inception as a standard by the organization Biodiversity Information Standards (TDWG) in 2009. Despite, or because of, its popularity, people trying to use the standard continue to have questions about how to use Darwin Core and associated extensions. This webinar series looks at open questions related to Darwin Core. Though the topic is broad, individual chapters in the series will focus on specific topics to any adequate level of depth. We encourage people to bring or submit questions and to have open discussions in each webinar.

Tuesday, September 05, 2017
