Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Michael J Elliott (mielliott@ufl.edu)
Received: 25 Nov 2024 | Published: 26 Nov 2024
© 2024 Michael Elliott, Manuel Luciano, Jose Fortes
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Elliott M, Luciano M, Fortes J (2024) Integrating Large Language Models and the iDigBio Portal for Conversational Data Exploration and Retrieval. Biodiversity Information Science and Standards 8: e142696. https://doi.org/10.3897/biss.8.142696
|
The advent of cloud-based large language model (LLM) services such as ChatGPT (Generative Pre-Trained Transformer) has given rise to a wide array of novel artificial intelligence (AI) applications. In particular, LLMs have been used to power AI assistants that serve as intermediaries between human users and online web services, namely, web-based application programming interfaces (web APIs). These AI assistants allow users to make requests in natural language to initiate complex processes, ranging from searching a database to making a reservation.
We are exploring the development of AI assistants that can intelligently search for and process species occurrence data served by the Integrated Digitized Biocollections (iDigBio) Portal. Though the portal already provides a human-friendly search interface, it is tailored for a very particular use case: finding and inspecting records that match the user’s search parameters. However, the underlying iDigBio APIs that power the search interface offer direct access to biodiversity data and metadata that can support a wider range of applications. An LLM-powered AI assistant with access to such APIs has the potential to redefine how researchers discover and interact with scientific data by 1) allowing users to interact with scientific databases using natural language, 2) serving as a single unifying interface for many different use cases, and 3) enhancing the user’s experience with AI insights that are backed by citable, curated data.
Fig.
An example conversation demonstrating our prototype chatbot's ability to perform record searches, count records, visualize species occurrences on a map, and initiate download requests.
Because the chatbot is intended for use by researchers, transparency is critical. When responding to user requests, LLMs often include their own internalized knowledge—which may be unreliable but difficult to verify—or make up information entirely. Thus, it must be abundantly clear how the chatbot forms its responses, in particular how the LLM interprets user requests and how it queries external APIs, such that users may independently assess the correctness of the chatbot's actions and link its conclusions back to data. The approach we adopted for the design of our prototype is illustrated in Fig.
Gray text (left) in the chatbot’s responses can be expanded to reveal information about actions it initiates. In this example, this includes a generated query to the iDigBio Summary API (middle) and a link to view the record counts returned by the API (right).
The chatbot's behavior is tightly controlled by the rigid use of specialized AI agents with expert-defined validators (Fig.
Our prototype chatbot makes use of LLM-powered agents paired with expert-defined validators. The OpenAI logo indicates a GPT-4-powered processes. The iDigBio logo indicates processes that call the iDigBio APIs.
As a prototype, the functionality of the chatbot is currently limited to the few illustrative use cases we have outlined. However, as the system is incrementally refined and expanded, we envision the single chatbot interface to be of interest to both the general public and researchers alike. For the general public, it may be a useful tool to learn about biodiversity in their local community and around the world. Meanwhile, researchers may find the chatbot useful for quickly navigating and exploring iDigBio's hosted data and APIs. The prototype is hosted online at chat.acis.ufl.edu with source code in GitHub.
species occurrence records, LLM, Artificial Intelligence (AI), chatbot
Michael Elliott
SPNHC-TDWG 2024
The research reported in this work was funded in part by grants from the National Science Foundation (DBI 2027654) and the AT&T Foundation.