Corresponding author: John Thomas Waller (
Academic editor:
Geographic outliers at
DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig.
Advanatages of DBSCAN :
Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed
Drawbacks :
Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob
Outlier detection and error
Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (
There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.
John Thomas Waller
TDWG 2020
This example shows that DBSCAN is able to cluster effectively while flagging points with low additional support in Japan (