Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Mariya Dimitrova (m.dimitrova@pensoft.net)
Received: 28 Sep 2020 | Published: 28 Sep 2020
© 2020 Mariya Dimitrova, Jorrit Poelen, Georgi Zhelezov, Teodor Georgiev, Donat Agosti, Lyubomir Penev
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Dimitrova M, Poelen J, Zhelezov G, Georgiev T, Agosti D, Penev L (2020) Semantic Publishing Enables Text Mining of Biotic Interactions. Biodiversity Information Science and Standards 4: e59036. https://doi.org/10.3897/biss.4.59036
|
Introduction
Scholarly literature is the primary source for biodiversity knowledge based on observations, field work, analysis and taxonomic classification. Publishing such literature in semantically enhanced formats (e.g., through Extensible Markup Language (XML) tagging) helps to make this knowledge easily accessible and available to humans and actionable by computers. A recent collaboration between Pensoft Publishers and Global Biotic Interactions (GloBI) (
Collaborative workflow for extraction of biotic interactions from tables in scholarly articles.
Methods
Biotic interactions were extracted from scholarly literature tables published in several biodiversity journals from Pensoft. Semantically enhanced publications were processed to extract the tables from the article XMLs. There were 6993 tables from 21 different journals. Using the Pensoft Annotator, a text-to-ontology mapping tool, we were able to detect tables that could contain biotic interactions. The Pensoft Annotator was used together with a modified subset of the OBO Foundry Relation Ontology (RO), concentrating on the term labeled ‘biotically interacts with’ and all its children. The contents and captions of all tables were run through the Pensoft Annotator, which returned the matching ontology terms and their position in the text.
The resulting subset of tables was then processed by GloBI, which parsed the tables to extract the taxonomic names participating in each interaction. The GloBI workflow also generated table citations by SPARQL queries to the OpenBiodiv triple store where all table and article metadata are stored (
Results
Annotation of biotic interactions via the Pensoft Annotator helped to identify 233 tables possibly containing biotic interactions out of the 6993 tables that were processed. Semantic annotation of taxonomic names within tables allowed GloBI to index the species including their complete taxonomic hierarchies. Currently, GloBI has indexed 2378 interactions, extracted from a subset of 46 of the 233 tables. Interactions extracted via this workflow are available on a special webpage on GloBI's website. Records of the communication behind this collaborative work between GloBI and Pensoft are publically available.
Discussion & Conclusion
One of the limitations of the workflow was the inability to detect the directionality of the interactions. In other words, the tables do not contain information about the subject and object of a given interaction. For instance, in a host-parasite interaction, we can not automatically detect which species is the host and which is the parasite. We plan to address this issue by performing semantic analysis (e.g., part-of-speech tagging) of the table captions to determine the exact subjects and objects in the interactions. In addition, complicated table structures impeded both the processing of tables by the Pensoft Annotator and their parsing by GloBI’s algorithms. We recognise the importance of adopting common formats for sharing interaction data, a practice that would greatly improve the post-publication indexing of tables by GloBI. An example of a standardised table structure is the standard table template for primary biodiversity data, introduced by Pensoft (
species interactions, semantics, publishing, FAIR data, GloBI, text mining
Mariya Dimitrova
TDWG 2020
This research has received partial funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 764840.