Biodiversity Information Science and Standards : Conference Abstract
|
Corresponding author: John Waller (jwaller@gbif.org)
Received: 29 Apr 2019 | Published: 13 Jun 2019
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation: Waller J (2019) Data Location Quality at GBIF. Biodiversity Information Science and Standards 3: e35829. https://doi.org/10.3897/biss.3.35829
|
|
I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig.
A gridded dataset animation illustrating the expecation of users versus the reality of underlying occurrence data.
GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis.
Data quality at GBIF will always be a moving target (
One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package).
I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.
GBIF, Data Quality, Gridded Datasets, Country Centroids, CoordinateCleaner, coordinateUncertaintyInMeters, footprintWKT
John Waller