It Takes Years for a Good Wine to Mature: Task Group 2 - data quality tests and assertions

Lee Belbin; Arthur Chapman; Paul J. Morris; John Wieczorek

doi:10.3897/biss.6.91078

Biodiversity Information Science and Standards : Conference Abstract

PDF

Conference Abstract

It Takes Years for a Good Wine to Mature: Task Group 2 - data quality tests and assertions

Lee Belbin^‡, Arthur Chapman^§, Paul J. Morris^|, John Richard Wieczorek^¶

‡ Blatant Fabrications Pty Ltd, Hobart, Australia

§ Australian Biodiversity Information Service, Ballan, Australia

| Harvard University, Boston, United States of America

¶ University of California, Berkeley, United States of America

Corresponding author: Lee Belbin (leebelbin@gmail.com)

Received: 31 Jul 2022 | Published: 01 Aug 2022

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Citation: Belbin L, Chapman A, Morris PJ, Wieczorek JR (2022) It Takes Years for a Good Wine to Mature: Task Group 2 - data quality tests and assertions. Biodiversity Information Science and Standards 6: e91078. https://doi.org/10.3897/biss.6.91078

Abstract

Data Quality Task Group 2 was established to create a suite of core tests and associated assertions about the 'quality' of biodiversity informatics data (Chapman et al. 2020). The group has been active since January 2017, about four years longer than its four main members would have anticipated. We all thought “How hard could it be?” The answer was “Harder than we thought!” We have invested well over two years full time into this project. There were multiple times over the past five years where we thought we were 95% done, but we were wrong. Were we dumb? I doubt it! The authors (other than the lead author) are highly experienced in biodiversity data quality, Darwin Core and data testing. Neither were we lazy.

Why has it gone so slowly? It is mostly due to the complexity of the task and the inability to meet face-to-face. Zoom just doesn’t cut it for this type of work. We achieved the most at our one face-to-face meeting in Gainesville (Florida) in 2018. Our advances over the past year have come from rounds of feedback between the test specifications, test implementation, development of data for validating the tests and comparison between results from implementations and the expectations of the validation data. There are hopefully useful lessons in this for similar projects.

We now have a solid base where future evolution, such as tests for specific environments, will be made relatively easy. The major components of this project are the 99 tests themselves, the parameters for these tests (see https://github.com/tdwg/bdq/issues/122), a vocabulary of the terms used in the framework and test data for validating implementations of the tests.

We remain focused on what we call core tests: those that provide power in evaluating ‘fitness for use’, are widely applicable and are relatively easy to implement. The test descriptions we have settled on are:

A human readable label (split into a test class, a target Darwin Core term and an ‘action’);
A Globally Unique Identifier for the test (a GUID);
A simple English description;
Test class from the Fitness-For-Use Framework (Data Quality Task Group 1): Validation, Amendment, Measure or Issue;
Resource Type (all of the Core tests operate on a single record);
Information Elements (specified as the applicable Darwin Core Class and as a list of specific Darwin Core terms required as inputs for the test);
Specification (an explanation of how the test works from an implementation perspective);
Data quality dimension (from the Fitness-for-Use Framework);
Warning type (ambiguous, amended, incomplete, invalid, issue, report, unlikely);
Parameters (options that allow implementations to behave differently in clearly defined ways such as the use of a national species list);
Source Authority (external references required by the test);
An example;
Source (the origin of the test);
References;
Link to reference implementations;
Link to source code and
Notes (explanations of subtle or not so subtle aspects of the test).

The composition of the core tests has been stable for over a year. We have generated most of the test data using the template: the applicable test, a unique identifier, input data, expected output data, the response status (e.g., “internal prerequisites not met”), the response result (e.g., “not compliant”), and an optional comment.

What remains to be done? We need to complete the test data, produce normative and non-normative documentation, and transform our work into a TDWG Technical Specification. While TG2 is over 95% complete, we would still welcome anyone who is interested to learn about biodiversity data quality to contribute.

Keywords

specifications, vocabulary, biodiversity data, validation, amendment, report

Presenting author

Lee Belbin

Presented at

TDWG 2022

Acknowledgements

We acknowledge the significant contributions of Paula Zermoglio and Alex Thompson as original TG2 team members. We also value the comments of Deborah Paul and Allan Koch Veiga on our GitHub issues.

Funding program

Grant title

Hosting institution

Ethics and security

Author contributions

Conflicts of interest

References

Chapman A, Belbin L, Zermoglio P, Wieczorek J, Morris P, Nicholls M, Rees ER, Veiga A, Thompson A, Saraiva A, James S, Gendreau C, Benson A, Schigel D (2020)

Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data

Biodiversity Information Science and Standards

https://doi.org/10.3897/biss.4.50889

Supplementary material

Endnotes