Proceedings of TDWG : Conference Abstract
|
Corresponding author: Saniya Sahdev (saniyasahdev@ufl.edu)
Received: 15 Aug 2017 | Published: 15 Aug 2017
© 2017 Saniya Sahdev, Deborah Paul, Matthew Collins, Jose Fortes
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Sahdev S, Paul D, Collins M, Fortes J (2017) Automated Generation of Lists of Unique Values from iDigBio Data Fields to Facilitate Data Quality Improvements. Proceedings of TDWG 1: e20306. https://doi.org/10.3897/tdwgproceedings.1.20306
|
iDigBio currently has over 100 million records with up to 260 fields per record [
The Darwin Core Hour webinar initiative [
One place to start improving data quality is with the fields from the DwC standard that recommend the use of a controlled vocabulary. There are 23 fields that recommend the use of a controlled vocabulary. A call went out to large aggregators to share comma separated values (CSV) files containing a list of distinct values found in each of these 23 fields, along with a count. The responses from iDigBio, the Global Biodiversity Information Facility (GBIF), and VertNet are stored in the TDWG Darwin Core Q&A GitHub repository [
Based on this community need to have more insight into controlled vocabulary data as well as experience with iDigBio’s existing data cleaning approaches, we have constructed an automated process to generate lists of unique values in iDigBio fields. We used the data available from dumps of the entire iDigBio data set, which are written out weekly and stored on the GUODA (Global Unified Open Data Access) infrastructure [
Dynamically generating this distinct value data is a first step in understanding the current vocabularies in use by data providers. Using summarization and clustering algorithms, data in the fields can be easily visualized and analyzed. With these data, not only can patterns beyond typos and counts be seen by anyone, but metrics can be put in place. As discipline-specific communities are able to easily see what is in a given field, they can work together to synthesize recommended vocabularies to improve future data. As the data are improved, the number of distinct clusters would be expected to decrease, as would the number of values found in a given cluster. Without these kinds of automated tools that build data products from aggregated data, it would be much harder to tackle many data quality issues.
Biodiversity, Data Quality, Data Cleaning, Darwin Core, Cloud Computing, Bio Collections Infrastructure
Matthew Collins