Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Adeline Kerner (kerner@mnhn.fr)
Received: 17 Aug 2024 | Published: 19 Aug 2024
© 2024 Adeline Kerner, Elie Saliba, Nicolas Bailly, Thierry Bourgoin, Régine Vignes Lebbe
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kerner A, Saliba EM, Bailly N, Bourgoin T, Vignes Lebbe R (2024) Automatically Generated Texts for Fauna and Flora from Structured Data Based on a Controlled Vocabulary. Biodiversity Information Science and Standards 8: e134931. https://doi.org/10.3897/biss.8.134931
|
|
Information systems like Xper3 and Fulgoromorpha Lists On the Web (FLOW) play a crucial role in managing and using biological data. These platforms store extensive collections of normalized data and structured taxonomic descriptions. By using controlled terminologies, they standardize the vocabulary, significantly enhancing the processes of identification, description, and comparison of various taxa. The massive assemblages of data hosted in these repositories could be reused to generate texts in natural languages automatically. The most immediate goal is to produce, from these information systems, more accessible and user-friendly displays in the form of taxon summary pages.
This automated production of textual outputs is a great addition that can be continuously updated as the databases evolve. Can structured data be reused to provide better species pages and to ensure updating if the data evolves? Will AI assist in this process, or will specific computing be needed? To address these questions, multiple possible approaches have been identified.
The most basic level of this process involves converting a single line from a taxon-by-character matrix into text that resembles natural language, similar to the descriptions found in botanical or zoological guides. The primary goal is to move beyond the rigid format of matrix lines or lists of characteristics (such as characters and states) of a species, and instead generate a coherent, easy-to-read paragraph intended for human eyes.
To achieve this objective, a first solution is given by Descrxp, a tool currently developed in conjunction with Xper. The user specifies the desired key outline for the output, and Descrxp fills this canvas using the database contents. While this approach is highly reliable and adaptable to various contexts, its drawback is being labor-intensive, and requiring significant human input and oversight (Fig.
On the left, a key outline defining how to describe the male genitalia (case of an Xper3 database on phlebotomine sandflies, Diptera). On the right, three text descriptions generated from this template, based on data from three different species.
The second solution uses AI. Data on well-known taxa, and on more obscure taxa have been tested with ChatGPT 3.5 or 4.0. ChatGPT succeeds well at generating natural language descriptions for well-known taxa (Fig.
On data related to common species (e.g., in Pinus), ChatGPT produces satisfactory texts (input on the left, output on the right).
However, when it comes to more obscure taxa, less represented on the internet, the results can be more inconsistent and unpredictable (Fig.
Example with a database on Archaeocyatha (A) Genus sheet from Xper3 (B) Description generated by chatGPT 4.0 based on a knowledge base whose data are structured with several levels of dependencies (C) Description generated by chatGPT 4.0 based on a knowledge base whose data have a minimal hierarchical structure.
However, if more careful attention is given to the output of chatGPT (Fig.
On the left: same description generated with chatGPT 4.0 as Fig.
To go further, it would be possible to use a comparison between taxa to produce a text that highlights the remarkable features of a taxon. Moreover, it could be interesting to try and synthesize several lines of the initial matrix to extract generalized descriptions of groups of taxa, such as the description of a genus based on its included species.
Using data from FLOW, Fishbase, and Xper3, the French ACDC (Counterfactual Learning for Controlled Data-to-text) project is creating a variety of text outputs. These range from "fill-in-the-blank" canvas to fully AI-generated content, utilizing data matrices derived from these pilot databases.
AI, data-to-text, Xper3
Adeline Kerner
SPNHC-TDWG 2024
ACDC (Counterfactual Learning for Controlled Data-to-text) project, ANR-21-CE23-0007