Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Auguste Gardette (auguste.gardette@ird.fr), Youcef Sklab (youcef.sklab@ird.fr), Eugeni Belda (eugeni.belda@ird.fr)
Received: 28 Aug 2024 | Published: 28 Aug 2024
© 2024 Auguste Gardette, Youcef Sklab, Eugeni Belda, Eric Chenin, Jean-Daniel Zucker
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Gardette A, Sklab Y, Belda E, Chenin E, Zucker J-D (2024) METAPLANTCODE: Harmonizing Plant Metabarcoding Pipelines in Europe. Biodiversity Information Science and Standards 8: e135729. https://doi.org/10.3897/biss.8.135729
|
|
The METAPLANTCODE project is dedicated to advancing and optimizing pan-European case studies on metabarcoding. The project's objectives include providing best practice recommendations, optimizing analysis pipelines for species identification, and creating user-friendly reference databases. To accomplish these objectives, METAPLANTCODE will identify and address gaps in current methodologies, publish best practice documents on FAIR (Findable, Accessible, Interoperable, Reusable) data publishing for plant metabarcode data to GBIF (Global Biodiversity Information Facility) and the INSDC (International Nucleotide Sequence Database Collaboration), and implement ELIXIR-compatible multimodal deep learning (DL) models in novel tools for standalone metabarcoding analyses using various data sources.
A significant focus of the project is enhancing species identification accuracy through GBIF records and metadata. This involves mapping regional, national, and international botanical taxonomic checklists, red lists, and floras to the Catalogue of Life (COL) via the COL ChecklistBank. Additionally, taxonomic and floristic literature will be semantically enriched with new entity recognition and relationship extraction modules, supporting the enhanced identification of species through domain-specific descriptive and phenotypic features. An interface will link taxonomic names to treatments, identify homonyms and synonyms, and facilitate the conversion and annotation of floras, red lists, and ecological treatments. All METAPLANTCODE products will adhere to FAIR standards by the project's end.
The project emphasizes knowledge transfer from the outset, engaging with associated partners and stakeholders. Key stakeholders will be identified, priorities set, and communication channels established, monitored, and adjusted as necessary. Efforts to enhance stakeholder engagement, training, and outreach will ensure that plant metabarcoding becomes a routine standard for biodiversity monitoring in Europe and beyond.
Deep Learning for Plant Metabarcoding
Within the METAPLANTCODE project, our team is tasked with improving taxonomic precision by integrating deep learning on metabarcoding data and metadata. Previous studies have demonstrated the applicability of deep learning to non-plant barcoding data and its computational efficiency compared to traditional bioinformatics approaches (
Deep Learning Models for Metabarcoding Data
Our approach involves evaluating the efficacy of several deep learning models—such as Convolutional Neural Networks (CNN)(
Performance comparison of machine learning and deep learning models on a dataset of 156 plant species using 16 barcodes per species from PLANiTS database (
Multimodal Refinement of Predictions
In the subsequent phase, we aim to refine genetic sequence classifications by employing a multimodal strategy. This approach will integrate genetic information with traditional botanical knowledge. We will utilize biological interaction lists (e.g., species-species, species-habitat) provided by the METAPLANTCODE project to train a large language model (LLM) on relevant scientific literature. This LLM, specifically tailored for plant biodiversity, will incorporate metadata associated with genetic samples (including location, temporality, and climatic conditions). By merging embeddings of both metadata and genetic data, we aim to enhance the accuracy of taxonomic predictions (Fig.
Proposed multimodal integration framework for enhancing taxonomic predictions in plant metabarcoding.
Upper Left: The genomics module represents models that generate initial taxonomic predictions based on sequencing data.
Upper Right: A graph illustrates the biodiversity knowledge, showcasing relationships between species and habitats. This knowledge is derived from biological interaction lists (e.g., species-species, species-habitat).
Center: The framework's core is a multimodal approach using a Large Language Model (LLM) trained on plant biodiversity literature. This LLM integrates metadata (e.g., location, temporality, climate) with genetic data, combining their embeddings to improve taxonomic predictions.
Conclusion
Through this research, we aim to develop an effective method for integrating genetic data with textual information from various sources. We anticipate that this approach will not only enhance plant metabarcoding but also be applicable to other barcoding fields, such as bacteria, fish, fungi, and more. Additionally, we expect this methodology to find broader applications in genomic research, providing valuable insights and improvements across diverse biological disciplines.
AI, multi-modality, biodiversity, barcoding, botany, herbarium
Auguste Gardette
SPNHC-TDWG 2024
Biodiversita+ is a European Biodiversity Partnership supporting excellent research on biodiversity with an impact for society and policy.