Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: De-Kai Kao (block58697@gmail.com)
Received: 19 Nov 2024 | Published: 19 Nov 2024
© 2024 De-Kai Kao, Chih-Kai Yang, Chien-Hsing Chen
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kao D-K, Yang C-K, Chen C-H (2024) Enhancing Plant Species Retrieval in Flora Through Language Model Integration. Biodiversity Information Science and Standards 8: e142132. https://doi.org/10.3897/biss.8.142132
|
Traditionally, textual data storage and retrieval systems were designed primarily for human reading, mainly relying on paper records. However, as information technology has advanced, computerized searches have become common. However, Boolean logic-based data retrieval systems often struggle to handle data's diversity and richness effectively. These systems rely on strict matching rules, which can lead to either too few or too many results. For example, when searching for plant species descriptions, a query like "circle" AND "ellipse" may exclude relevant records that describe these traits using slightly different terms (e.g., "round" or "oval"). Conversely, broader queries like "oblong" may return an overwhelming number of irrelevant results. This rigidity limits the system's ability to adapt to the nuanced and varied ways users describe data. With the advent of advanced semantic models such as SBERT (Sentence-Bidirectional Encoder Representations from Transformers) (
In plant taxonomy, records in Flora, such as Flora of Taiwan or Flora of China, play a crucial role in understanding plant diversity in specific regions. These records provide critical information on plant growth environments, morphological characteristics, and economic values.
Our research aims to enhance the efficiency of retrieving plant data using language models. Specifically, we transform textual descriptions from Flora and user queries into vector representations (Fig.
Cosine similarity and aggregated scoring for Flora trait queries.
The calculation process provides a visual representation of the cosine similarity scores between user queries and Flora traits. In the middle section, each row represents a specific trait of a plant species, while columns correspond to the user's query traits (Trait 1, Trait 2, Trait 3). The cosine similarity score measures how closely a trait from the user's query aligns with traits in the Flora dataset.
Red numbers highlight the highest similarity score for each species across all its traits, representing the trait most relevant to the user's query. At the bottom, the aggregated scores show the average of these highest scores, providing an overall similarity score for each species and ranking their relevance to the user's query.
Embedding lookup retrieved from the language model.
This workflow illustrates the process of transforming textual data into vector representations using a pre-trained language model. The leftmost column contains the original textual inputs, including both species descriptions and user queries. These texts are associated with unique IDs (middle column) for reference. The retrieved vector representations (rightmost column) are numerical embeddings generated by the language model. Each row represents a unique vector, which captures the semantic meaning of the corresponding text. The numbers within each vector represent the values of individual dimensions in the vector space. These values are used for calculating the cosine similarity between the vectors.
Our results demonstrate the potential of language models to facilitate biodiversity research and data management, especially in retrieving plant taxonomy information. Our approach provides a novel tool for future biodiversity data analysis and retrieval, thereby contributing to the progress of biodiversity conservation.
species identification, cosine similarity, semantic retrieval
De-Kai Kao
SPNHC-TDWG 2024