Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Ni Yan (ni.yan@naturalis.nl)
Received: 11 Sep 2024 | Published: 12 Sep 2024
© 2024 Laurens Hogeweg, Ni Yan, Django Brunink, Khadija Ezzaki-Chokri, Wilfred Gerritsen, Rita Pucci, Burooj Ghani, Dan Stowell, Vincent Kalkman
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hogeweg L, Yan N, Brunink D, Ezzaki-Chokri K, Gerritsen W, Pucci R, Ghani B, Stowell D, Kalkman VJ (2024) AI Species Identification Using Image and Sound Recognition for Citizen Science, Collection Management and Biomonitoring: From Training Pipeline to Large-Scale Models. Biodiversity Information Science and Standards 8: e136839. https://doi.org/10.3897/biss.8.136839
|
Biodiversity data are currently being generated at an unprecedented rate from deployed field monitoring sensors (e.g., wildlife and insect cameras, sound recorders, radars), citizen science observations, digitised museum collections, and biodiversity- and environmental-generated research. Deep neural networks have made it possible to automatically identify species on multimedia (e.g., image, sound, radar, DNA) with increasing accuracy and efficiency, a task that would otherwise be impossible for taxonomic experts to perform at the rate and scale at which these data are being generated. Artificial intelligence (AI) models can help understand biodiversity data and automate tasks.
At Naturalis Biodiversity Center, we developed several AI species identification models using image or sound recognition for citizen science, collection management and biomonitoring purposes. We present here a pipeline for training large-scale AI species identification models combining multiple sources of image training data that cover the most commonly encountered macro-organisms in Europe.
The training pipeline is shown in Fig.
As shown in Fig.
Measured on the same test data, which have not been used for training the models, the 2023 large-scale multi-source model (MSM), fine-tuned and customised for Observation.org, showed significant performance improvement compared to the 2021 model trained with only their own data. As shown in Fig.
Measured performance improvement of MSM (2023) vs Observation model (2021) using the same test data. Accuracy is measured as the percentage of observations in which the first prediction is correct. Average recall is measured as the average recognition rate per taxon. Higher values indicate better recognition of rarer taxa.
Fig.
Strong class imbalance in data and its effect on accuracy and average recall in 2023 MSM for Observation.org mollusks (Grey area: 95% of observations).
Fig.
Measured performance improvement of MSM (2023) vs artsdatabanken model (2022) for Norwegian arthropods (Grey area: 95% of observations).
The large-scale species identification model, with its 39 specialised models, has been deployed as an auto-scaling web service used by seven (in 2024) biodiversity portals in Europe, and has performed about 65 million identifications in the past 12 months (Aug 2023–Aug 2024), allowing citizen scientists and interested public to identify European flora and fauna using web interface and/or interactive mobile apps, increasing the speed of collecting citizen science data.
Continuous developments of advanced features for this large-scale species identification model are taking place. In the 2023 model, we have implemented explicit probability calibration of AI identifications, allowing automatic validation. Auto-validation is a feature that suggests those AI identifications of the data with low risk, without the need for expert review. Advanced features to be implemented in the 2024 model include providing prediction probabilities at all taxonomic levels (only species level in the 2023 model) and developing life-stage models for other species groups. Planned advanced features for 2025 include context-aware identification (using location, time and neighbouring species to improve identification), rejecting invalid and unusable input such as selfies, poor quality and unknown taxa (
We have developed this large-scale multi-source model using citizen science observation data from several European biodiversity portals. This AI training pipeline can be applied to develop other large-scale, multi-source algorithms for biodiversity monitoring with sensor input (e.g., insect cameras), digitised museum collection identification as part of the digitisation and collection management workflow, and sound recognition models for citizen science and biomonitoring.
artificial intelligence, machine learning, deep learning, hierarchical model, accuracy, recall, class imbalance
Ni Yan
SPNHC-TDWG 2024
Some of our AI work is partially funded by the European Commission through the project MAMBO, GUARDEN and TETTRIs. Views and opinions expressed are those of the author(s) only.
Naturalis Biodiversity Center, Leiden, The Netherlands