63urn:lsid:arphahub.com:pub:0E0032F4-55AE-5263-8B3C-F4DD637C30C2Biodiversity Information Science and StandardsBISS2535-0897Pensoft Publishers10.3897/biss.6.942979429720131Conference AbstractSYM04 - Sharing and visualizing species data and informationNEARSIDE: Structured kNowledge Extraction frAmework from SpecIes DEscriptions SahraouiMayasahraoui@isir.upmc.fr1PignalMarchttps://orcid.org/0000-0002-6772-92992Vignes LebbeRégine3GuigueVincent1ISIR, Paris, FranceISIRParisFranceMNHN, Paris, FranceMNHNParisFranceInstitut de Systématique, Evolution, Biodiversité (ISYEB), Muséum national d'Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, FranceInstitut de Systématique, Evolution, Biodiversité (ISYEB), Muséum national d'Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des AntillesParisFrance
2022070920226e94297F742F647-42BA-59E7-923E-D319C7C64D30Maya Sahraoui, Marc Pignal, Régine Vignes Lebbe, Vincent GuigueThis is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Species descriptions are stored in textual form in corpora such as in floras and faunas, but this large amount of information cannot be used directly by algorithms, nor can it be linked to other data sources. The production of knowledge bases expressing structured data can benefit from collaborative and easy-to-use platforms like Xper3 (Vignes-Lebbe et al. 2017, Kerner and Vignes 2019, Saucède et al. 2021) but is very time-consuming at the human level. It is therefore mandatory for this task to make the information contained in species descriptions measurable and compatible with computer techniques.
One of the most used data structures on the web and by the deep learning community is the triplet structure. Each piece of information is represented by a set of 3 elements (subject, predicate, object). One of the first steps towards species information accessibility is developing a text-to-triplet model, also known as text-to-graph, for monograph descriptions.
In this work, we developed NEARSIDE, a text-to-graph model adapted to biology corpora to create normalized morphological characteristic knowledge bases for species descriptions.
In Natural Language Processing, deep learning models have proven to be effective in extracting knowledge from open domain corpora (Lample et al. 2016, Sutskever et al. 2014), especially since the emergence of attention-based models (Devlin et al. 2019b, Devlin et al. 2019a). Several works have been made also on biomedical corpora (Fries et al. 2017,Cho and Lee 2019). In our case, we propose a model adapted to floras.
Fully supervised deep learning models require a large amount of annotated data for training, nevertheless, the annotation process for the text-to-triplet task implies an expensive human intervention. Distant supervision is a technique that can be used to reduce this cost. This paradigm uses a small annotated glossary to project classes at the word level on a new complex and longer text (see Fig. 1).
Named Entity Recognition (NER) is an Natural Language Processing (NLP) task that consists of extracting and classifying words of interest from a text (Sutskever et al. 2014, Devlin et al. 2019b, Lample et al. 2016),while triplet extraction can be compared to the Relation Extraction task (RE) which consists of extracting the words and the semantic relations between pairs of words. Distantly supervised NER is an often studied subject in the literature in comparison to distantly supervised RE (Liang et al. 2020, Meng et al. 2021) simply because NER is a subtask to RE and distant annotations generation is less expensive for the NER task (see Fig. 2).
Our first contribution is creating a distantly annotated species description dataset for Named Entity Recognition with a well-balanced test set that allows us to bypass several biases that can be induced by the distant annotation and that are often observed in NER datasets (Taillé et al. 2021). In this dataset, each word of interest will be classified into one of 15 classes, each class being a specific kind of organ or descriptor.
Our second contribution is proposing a distantly supervised model trained on our dataset, since fauna and flora corpora are particularly long and use a very specific technical vocabulary. We develop a context-oriented model adapted to this data by pretraining the language model. Thus the encoder of our model provides contextualized vectors for each extracted word that can be used to measure description similarities between different species. Our model reaches 96% accuracy in named entity classification on the test set.
Our third contribution is the triplet construction module that can directly be applied to our model's outputs. This module is based on class dependency rules that are inspired by Xper3’s data representation format (see Fig. 3).
Finally, NEARSIDE is an end-to-end structured knowledge extraction framework from unstructured species description corpora, that can be applied to several data sources. Thus making species descriptions from different corpora easily linked, compared and measured.
Natural Language ProcessingArtificial Intelligencespecies identificationbiodiversityPresenting author
Maya Sahraoui
ReferencesChoHyejinLeeHyunju2019Biomedical named entity recognition using deep neural networks with contextual information201https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3321-410.1186/s12859-019-3321-4DevlinJacobChangMing-WeiLeeKentonToutanovaKristina2019BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXivNumber: arXiv:1810.04805
arXiv:1810.04805 [cs]http://arxiv.org/abs/1810.04805DevlinJacobChangMing-WeiLeeKentonToutanovaKristina2019BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingarXivNumber: arXiv:1810.04805 arXiv:1810.04805 [cs]http://arxiv.org/abs/1810.04805FriesJasonWuSenRatnerAlexRéChristopher2017SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled DataarXivNumber: arXiv:1704.06360 arXiv:1704.06360 [cs]http://arxiv.org/abs/1704.06360KernerAdelineVignesRegine2019Multi-context Knowledge Base using Calculated Descriptors from Xper3: the Archaeocyaths Knowledge Base example3https://hal.archives-ouvertes.fr/hal-0255731610.3897/biss.3.37083LampleGuillaumeBallesterosMiguelSubramanianSandeepKawakamiKazuyaDyerChris2016Neural Architectures for Named Entity RecognitionarXivComment: Proceedings of NAACL 2016http://arxiv.org/abs/1603.01360LiangChenYuYueJiangHaomingErSiawpengWangRuijiaZhaoTuoZhangChao2020Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mininghttps://dl.acm.org/doi/10.1145/3394486.3403149978-1-4503-7998-410.1145/3394486.3403149MengYuZhangYunyiHuangJiaxinWangXuanZhangYuJiHengHanJiawei2021Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-TrainingarXivComment: EMNLP 2021. (Code: https://github.com/yumeng5/RoSTER)http://arxiv.org/abs/2109.05003SaucèdeThomasEléaumeMarcJossartQuentinMoreauCamilleDowneyRachelBaxNarissaSandsChesterMercadoBorjaGallutCyrilVignes-LebbeRégine2021Taxonomy 2.0: computer-aided identification tools to assist Antarctic biologists in the field and in the laboratory3313951https://www.cambridge.org/core/journals/antarctic-science/article/abs/taxonomy-20-computeraided-identification-tools-to-assist-antarctic-biologists-in-the-field-and-in-the-laboratory/519BC16BF54C6900ABCB849A2379DAE110.1017/S0954102020000462SutskeverIlyaVinyalsOriolLeQuoc V.2014Sequence to Sequence Learning with Neural NetworksarXivComment: 9 pageshttp://arxiv.org/abs/1409.3215TailléBrunoGuigueVincentScoutheetenGeoffreyGallinariPatrick2021Separating Retention from Extraction in the Evaluation of End-to-end Relation ExtractionarXivComment: Accepted at EMNLP 2021http://arxiv.org/abs/2109.12008Vignes-LebbeRégineBouquinSylvainKernerAdelineBourdonEstelle2017Desktop or remote knowledge base management systems for taxonomic data and identification keys: Xper2 and Xper31https://hal.archives-ouvertes.fr/hal-0255732710.3897/tdwgproceedings.1.19911DC38DDE4-66F1-5178-ADFE-FDADB4346D95
Illustration of the distant annotation technique applied on textual data.