Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Peter Cornwell (peter.cornwell@data-futures.org)
Received: 13 Sep 2023 | Published: 14 Sep 2023
© 2023 Peter Cornwell
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Cornwell P (2023) Progress with Repository-based Annotation Infrastructure for Biodiversity Applications. Biodiversity Information Science and Standards 7: e112707. https://doi.org/10.3897/biss.7.112707
|
Rapid development since the 1980s of technologies for analysing texts, has led not only to widespread employment of text 'mining', but also to now-pervasive large language model artificial intelligence (AI) applications. However, building new, concise, data resources from historic, as well as contemporary scientific literature, which can be employed efficiently at scale by automation and which have long-term value for the research community, has proved more elusive.
Efforts at codifying analyses, such as the Text Encoding Initiative (TEI), date from the early 1990s and were initially driven by the social sciences and humanities (SSH) and linguistics communities, and extended with multiple XML-based tagging schemes, including in biodiversity (
This continual evolution has made the preservation of investment using annotation methods, and in particular of the connections between annotations and their context in source literature, particularly challenging. Infrastructure that entered service during the intervening years does not yet support WADM, and has only recently started to address the parallel emergence of page imagery-based standards such as the International Image Interoperability Framework (IIIF). Notably, IIIF instruments such as Mirador-2, which has been employed widely for manual creation and editing of annotations in SSH, continue to employ the now-deprecated OADM. Although multiple efforts now address combining IIIF and TEI text coordinate systems, they are currently fundamentally incompatible.
However, emerging repository technologies enable preservation of annotation investment to be accomplished comprehensively for the first time. Native IIIF support enables interactive previewing of annotations within repository graphical user interfaces and dynamic serialisation technologies provide compatibility with existing XML-based infrastructures. Repository access controls can permit experts to trace annotation sources in original texts even if the literature is not publicly accessible, e.g., due to copyright restriction. This is of paramount importance, not only because surrounding context can be crucial to qualify formal terms that have been annotated, such as collecting country. Also, contemporary automated text mining—essential for operation at the scale of known biodiversity literature—is not 100% accurate and manual checking of uncertainties is currently essential. On-going improvement of language analysis tools through AI integration offers significant future gains from reprocessing literature and updating annotation data resources. Nevertheless, without effective preservation of digitized literature, as well as annotations, this enrichment will not be possible—and today's investments in gathering together, as well as analysing scientific literature will be devalued or lost.
We report new functionality included in the InvenioRDM*
Moreover, an annotation service based on the WADM-native Mirador-3 FOSS IIIF viewer has now been developed and will enter service with ZenodoRDM. This enables editing of biodiversity annotations from within the repository interface, as well as automated updating of taxonomic information products provided to other major infrastructures such as GBIF.
Two aspects of this ZenodoRDM annotation service are presented:
Workflows for editing existing biodiversity annotations, as well as origination of new annotations, need to be tailored for specific tasks—e.g., unifying geographic collecting location definitions in historic reports—via configurable dialogs for contributors and controlled vocabularies. Selectively populating workflows with annotations according to a task definition is also important to avoid cluttering the editing GUI with non-essential information. Updated annotations are integrated into a new annotation collection upon completion of a task, before updating repository records.
Current work on annotation workflows for SSH applications is also reported. The ZenodoRDM biodiversity annotation service implements a generic repository micro-service API, and the implementation of similar services for other repository software platforms is discussed.
biodiversity literature, IIIF, InvenioRDM, WADM, Zenodo
Peter Cornwell
TDWG 2023