Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Felipe Simoes (simoes@plazi.org)
Received: 17 Sep 2021 | Published: 20 Sep 2021
© 2021 Felipe Simoes, Donat Agosti, Marcus Guidoti
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Simoes F, Agosti D, Guidoti M (2021) Delivering Fit-for-Use Data: Quality control. Biodiversity Information Science and Standards 5: e75432. https://doi.org/10.3897/biss.5.75432
|
Automatic data mining is not an easy task and its success in the biodiversity world is deeply tied to the standardization and consistency of scientific journals' layout structure. The various formatting styles found in the over 500 million pages of published biodiversity information (
However, in the era of big data, the liberation of all the different facts contained in biodiversity literature is of crucial importance. Plazi tackles this daunting task by providing workflows and technology to automatically process biodiversity publications and annotate the information therein, all within the principles of FAIR (findable, accessible, interoperable, and reusable) data usage (
In order to cope with this remarkable task without compromising data quality, Plazi has established a quality control process, based on logical rules that check the components of the extracted document raising errors in four different levels of severity. These errors are also used in a data transit control mechanism, “the gatekeeper”, which blocks certain data transits to create deposits (e.g., BLR) or reuse of data (e.g., GBIF) in the presence of specific errors. Finally, a set of automatic notifications were included in the plazi/community Github repository, in order to provide a channel that empowers external users to report data issues directly to a dedicated team of data miners, which will in turn and in a timely manner, fix these issues, improving data quality on demand.
In this talk, we aim to explain Plazi’s internal quality control process and phases, the data transits that are potentially affected, as well as statistics on the most common issues raised by this automated endeavor and how we use the generated data to continuously improve this important step in Plazi's workflow.
annotations, biodiversity data, FAIRification, TreatmentBank
Felipe Simoes
TDWG 2021
The BiCIKL (Biodiversity Community Integrated Knowledge Library — https://bicikl-project.eu/) project receives funding from the European Union's Horizon 2020 Research and Innovation Action under grant agreement No 101007492