Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Franck Michel (franck.michel@inria.fr)
Received: 18 Aug 2022 | Published: 23 Aug 2022
© 2022 Franck Michel, Anne Toulet, Anna Bobasheva, Marie-Claude Deboin, Sébastien Dupré, Aline Menin, Marco Winckler, Andon Tchechmedjiev
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Michel F, Toulet A, Bobasheva A, Deboin M-C, Dupré S, Menin A, Winckler M, Tchechmedjiev A (2022) Semantic Indexing of Open Scientific Literature to Help Users Discover and Navigate through Publications Networks. Biodiversity Information Science and Standards 6: e93640. https://doi.org/10.3897/biss.6.93640
|
|
In recent years, several evolutions have drastically transformed the way researchers as well as scientific and technical information (STI) services interact with scientific literature. The amount and pace of publications are skyrocketing, whether in journals and conferences or through pre-publication repositories (e.g., arxiv.org), such that it is increasingly difficult to keep up, find and make sense of relevant articles. Furthermore, the specialization of research communities makes it difficult to discover cross-disciplinary knowledge, which is essential to meet the growing demand of funding agencies for interdisciplinary projects. Scientific open archives are central in this landscape, however the keyword-based search services that they usually provide fail to grasp the semantic relationships between articles. Therefore, it is necessary to develop new tools that allow users to find their way in this mass of knowledge.
In this talk, we wish to present the methods, tools and services implemented in the ISSA*
The semantic index construction process involves several artificial intelligence techniques: natural language processing, knowledge engineering and Semantic Web. These techniques are used to process the publications’ metadata and text to automatically extract thematic descriptors and named entities. These descriptors and named entities are linked to knowledge bases such as Wikidata, DBpedia and GeoNames, or domain-specific terminological resources suited to the archive's domain. The semantic index linked with the third-party resources serves as a keystone to support the development of rich search and visualization tools aimed at researchers and/or STI professionals.
We demonstrated the effectiveness of this solution in the use case of Agritrop, an institutional archive of 110,000+ resources among which are 12,000 open access articles, specialized in the fields of agronomy, biodiversity and sustainable development. In this context, the Agrovoc multilingual thesaurus was used as a domain-specific reference vocabulary. Fig.
Association rule stating that articles mentioning concepts COVID-19 and food security (a) also frequently mention the pandemics concept (b).
Exploration of the relationship between concepts health and climate change or any of their sub-concepts.
Being designed as a generic, transferable solution, the pipeline and visualization tools delivered by ISSA could be easily adapted to open archives of biodiversity literature. Typically, terminological references such as Darwin Core Terms, Access to Biological Collection Data (ABCD), open Digital Specimens (openDS), Audubon Core Metadata Schema as well as various taxonomic registries, could be considered for the description of an article's metadata or the linking of thematic descriptors and named entities. From there, the proposed visualization techniques could easily be reconfigured to explore the articles from a biodiversity open archive to answer various competency questions, for instance: what are the articles that mention a taxon or any of its child taxa? What are the museums/institutions that are more frequently mentioned together with certain taxonomic groups? What are the research topics that frequently co-occur with climate change, and how do these topics evolve through the years? What public policies frequently occur in articles that mention endangered species? Furthermore, the pipeline could be extended by including existing third-party tools to carry out e.g., the extraction of relationships between entities or the reconciliation of authors' names.
data indexing, knowledge graph, data visualization, scientific archive
Franck Michel
TDWG 2022
Findable, Accessible, Interoperable, Reusable
ISSA stands for Semantic Indexing of a scientific archive Associated Services