Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Lee Belbin (leebelbin@gmail.com)
Received: 25 Sep 2020 | Published: 01 Oct 2020
© 2020 Lee Belbin, Arthur Chapman, John Wieczorek, Paul J. Morris, Paula Zermoglio
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Belbin L, Chapman A, Wieczorek J, Morris PJ, Zermoglio PF (2020) Task Group 2 – Data Quality Tests and Assertions. Biodiversity Information Science and Standards 4: e58982. https://doi.org/10.3897/biss.4.58982
|
Motivation
Other than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is especially so for the research community. Data Quality Tests and Assertions Task Group (TG-2) from the Biodiversity Information Standards (TDWG) Biodiversity Quality Interest Group is reviewing practical aspects relating to ‘data quality’ with a goal of providing a current best practice at the key interface between data users and data providers: tests and assertions. If an internationally agreed standard suite of core tests and resulting assertions can be used by all data providers and aggregators and hopefully data collectors, then greater and more appropriate use could be made of biodiversity data. Adopting this suite of core tests, data providers and particularly aggregators such as the Global Biodiversity Information Facility (GBIF) and its nodes would have increased credibility with the user communities and could provide more effective information for evaluating ‘fitness for use’.
Goals, Outputs and Outcomes
Strategy
The tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. The priority is to create a fully documented suite of core tests that define a framework for ready extension across terms and domains.
Status 2019-2020
The core tests have proven to be far more complex than any of the team had anticipated. Several times over the past three years, we believed we had finalized the tests, only to find new issues that have required a fresh understanding and subsequent edits, e.g., the most recent dropping of the two tests related to dwc:identificationQualifier:
This decision resulted from a review of dwc:identificationQualifier values in GBIF records and an evaluation of expected values based on the Darwin Core definition of the term. Aside from there being many values, the term expects the qualifier in relation to a given taxonomic name, and rules of open nomenclature are unevenly adopted across data records to reliably parse and detect dwc:identificationQualifier for these tests to be effective.
A similar situation occurs for dwc:scientificName, where we have resorted to the term “polynomial” to refer to the non-authorship part of dwc:scientificName.
What has occurred during the past year?
We will provide details of the challenges, the breakdown of the tests and the advances of the project.
Darwin Core, fitness-for-use, parameters, code
Lee Belbin
TDWG 2020