Biodiversity Information Science and Standards : Conference Abstract
PDF
Conference Abstract
Relation Extraction From Unstructured Species Descriptions Using TaxonNERD and LLaMA 2 7B
expand article infoFabricio De Jesus Rios Montero, Ervin Rodríguez§, Maria Mora Cross§
‡ Computer science, Heredia, Costa Rica
§ Computer science, Alajuela, Costa Rica
Open Access

Abstract

Ontologies are essential tools for organizing information on taxonomy, ecology, and inter-species relationships, helping to standardize ecological data and facilitate integration of large datasets. Combining ontologies with advanced Natural Language Processing (NLP) techniques, such as Named Entity Recognition (NER) and Relation Extraction (RE), has greatly improved the discovery of insights from unstructured scientific texts, particularly in biodiversity (Gabud et al. 2023, Abdelmageed et al. 2022, Hearst 1992).

This study combines ontologies and NLP to analyze complex trophic interactions among animal species (Gabud et al. 2023), using a dataset (National Biodiversity Institute of Costa Rica (INBio) 2015) containing species descriptions in English and Spanish. We applied TaxoNERD to identify taxonomic entities (Le Guillarme and Thuiller 2021) and we fine-tuned the Large Language Model Meta AI (LLaMA 2 7B) to extract feeding interactions and predator-prey relationships    (CheeKean 2023), due to its effectiveness in handling complex language patterns and its adaptability to diverse scientific domains.

Our results (Fig. 1) showed a recall of 0.73 and a precision of 0.68, indicating that the model effectively identifies feeding relationships in most cases. However, the lower precision suggests that the model may still capture some unrelated interactions, highlighting an area for improvement to reduce false positives and increase accuracy (Touvron et al. 2023). Previous studies also emphasize the need for further refinement of relation extraction models to enhance accuracy (Mora-Cross et al. 2023). The structured dataset offers valuable insights into species’ diets and roles, contributing to biodiversity research and conservation efforts (Mora-Cross et al. 2023, Touvron et al. 2023).

Figure 1.

Distribution of BERTScore (Bidirectional Encoder Representations from Transformers) metrics (F1, Precision, Recall) with most scores between 0.6 and 0.9. Outliers in Precision suggest areas for improving accuracy and reducing false positives.

Moreover, this research highlights the potential of integrating AI-driven tools with ontological frameworks to manage and analyze biodiversity data at scale (Abdelmageed et al. 2022). By transforming unstructured text into structured data, we make ecological information more accessible, supporting better decision-making in conservation strategies (Abdelmageed et al. 2022, Hearst 1992). This approach scales well with the growing volume of biodiversity data, offering a more efficient and accurate method for analyzing species interactions, which are crucial for ecosystem management and endangered species protection (Gabud et al. 2023).

Keywords

biodiversity, ontologies, Named Entity Recognition (NER), Relation Extraction (RE), LLaMA2-7b, feeding relationships

Presenting author

Fabricio Ríos Montero

Presented at

SPNHC-TDWG 2024

Acknowledgements

This work was made possible through the support provided by the Instituto Nacional de Costa Rica (ITCR), the International Development Research Center (IDRC) through the Central American Higher University Council (CSUCA),  and the Costa Rican Innovation and Research Promoter of the Ministry of Science, Innovation, Technology, and Telecommunications (MICITT) of Costa Rica.

Conflicts of interest

The authors have declared that no competing interests exist.

References

login to comment