Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Vamsi Krishna Kommineni (vamsi.krishna.kommineni@uni-jena.de)
Received: 10 Sep 2024 | Published: 10 Sep 2024
© 2024 Vamsi Krishna Kommineni, Waqas Ahmed, Birgitta Koenig-Ries, Sheeba Samuel
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kommineni VK, Ahmed W, Koenig-Ries B, Samuel S (2024) Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study. Biodiversity Information Science and Standards 8: e136735. https://doi.org/10.3897/biss.8.136735
|
Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data and generate diverse insights. Biodiversity literature, with its broad range of topics, is no exception to this trend (
In our previous work (
To evaluate our pipeline, we compared the expert-assisted manual approach with the LLM-assisted automatic approach. We measured their consistency using the inter-annotator agreement (IAA) and quantified it with the Cohen Kappa score (
Future work will involve several key improvements to our LLM-assisted information retrieval pipeline:
Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications.
Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs.
Leveraging LLMs to automate information retrieval from biodiversity publications signifies a notable advancement in the scalable and efficient analysis of biodiversity literature. Initial results show promise, yet there is substantial potential for enhancement through the integration of multimodal data, optimized retrieval mechanisms, and comprehensive evaluation. By addressing these areas, we aim to improve the accuracy and utility of our pipeline, ultimately enabling broader and more in-depth analysis of biodiversity literature.
Large Language Models (LLMs), information retrieval, deep learning, Retrieval Augmented Generation (RAG), biodiversity
Vamsi Krishna Kommineni
SPNHC-TDWG 2024
We acknowledge computing time on the HPC cluster Draco provided by the IT centre of the Thuringian universities.
Supported by the German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, funded by the German Research Foundation (FZT 118, 202548816) and Carl Zeiss Foundation for the project “A Virtual Werkstatt for Digitization in the Sciences (K3)” within the scope of the program line “Breakthroughs: Exploring Intelligent Systems for Digitization-Explore the Basics, Use Applications.”
Friedrich Schiller University Jena, Jena, Germany