Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Kristen "Kit" Lewers (krle4401@colorado.edu)
Received: 29 Nov 2024 | Published: 29 Nov 2024
© 2024 Kristen "Kit" Lewers
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Lewers K (2024) Comparative Methods for Building Chatbots: Open Source, Hybrid, and Fully Integrated Large Language Models. Biodiversity Information Science and Standards 8: e143032. https://doi.org/10.3897/biss.8.143032
|
In the complex and dynamic realm of biodiversity informatics, the accessibility and comprehension of standards and vocabularies are pivotal for, but not limited to, effective data management, research, policy, regulation, and education. Biodiversity Information Standards (TDWG) provides a suite of standards crucial for the interoperability and consistency of biodiversity data applied to petabytes of data aggregated at GBIF (Global Biodiversity Information Facility). Among these, Darwin Core (DwC;
This project introduces an innovative approach to mitigating the complexities inherent in navigating TDWG standards. The project aims to create specialized, conversational interfaces by leveraging different methods, including fully open-source solutions without large language models (LLMs), fully open-source solutions using LLMs, hybrid approaches leveraging OpenAI's API, and fully integrated solutions using GPT (Generative Pre-trained Transformer) models. These interfaces are designed to facilitate easier querying and interpretation of the nuanced aspects of biodiversity standards. This could be especially helpful for individuals who are new to the world of biodiversity standards and are not sure where to start. The implementation would allow for individuals to engage with standards on their own time and own terms, if members of the organization were unavailable. The urgency and importance of this project are underscored by the accelerating pace of biodiversity loss and the critical role of data standards in supporting research efforts. By enhancing the accessibility of TDWG standards, this project directly contributes to improving data management practices, thereby supporting the broader objectives of biodiversity informatics.
The methodology begins with a comprehensive data collection phase, targeting both the structured documentation of TDWG standards and the community-generated content on GitHub*
For this project, the choice of models was driven by both budget constraints and the need for accurate, detailed responses. For example, the older OpenAI davinci-002 model, despite its affordability, yielded results that were less than satisfactory, even though a GPT product, highlighting the trade-offs between model capabilities and cost. The comparative analysis of the four methods is based on criteria such as performance, cost, ease of implementation, flexibility, and scalability with testing and iteration still on-going.
Developing a specialized chat model for biodiversity informatics standards is a complex and multi-step process that involves careful data collection, preparation, and iterative model training. Each method brings its own set of challenges and benefits, and the choice of method can significantly impact the chatbot's effectiveness and user satisfaction. Despite the complexities, the proofs of concept thus far have demonstrated promising results and will continue to be refined with the goal of enhancing the tool’s accuracy and user-friendliness. User testing and feedback with a variety of experience levels regarding TDWG standards are the next steps in the project. This project represents a confluence of cutting-edge artificial intelligence and community-sourced expertise aimed at bridging gaps in the field of biodiversity informatics. By making the TDWG standards more accessible and understandable, this initiative aims to enhance support for biodiversity informatics workflows, improve data management practices, and foster a deeper engagement with biodiversity data standards.
NLP, machine learning, GPT, biodiversity informatics, Darwin Core, TDWG, biodiversity informatics workflows, user engagement, standards accessibility
Kristen "Kit" Lewers
SPNHC-TDWG 2024