Proceedings of TDWG : Conference Abstract
|
Corresponding author: Evangelos Pafilis (pafilis@hcmr.gr)
Received: 09 Aug 2017 | Published: 10 Aug 2017
© 2017 Evangelos Pafilis, Rūdolfs Bērzinš, Christos Arvanitidis, Lars Jensen
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Pafilis E, Bērzinš R, Arvanitidis C, Jensen L (2017) EXTRACT 2.0: interactive identification of biological entities mentioned in text to assist database curation and knowledge extraction. Proceedings of TDWG 1: e20152. https://doi.org/10.3897/tdwgproceedings.1.20152
|
Data curation is a process occurring in many aspects of biodiversity research, e.g. in digitization of specimen collections and extraction of species occurrences from the legacy literature. Data curation is always characterized by being time demanding and tedious. Gathering information on species and exposing it via search interfaces could be facilitated once phrases of interest have been recognized and the mentioned entities have been linked to community resources.
A curator can benefit from interactive systems that highlight biological entities in a document, indicating sections of interest, and map entities to corresponding database records/ontology terms, and offering an easy mechanism for extracting annotations in a structured form.
EXTRACT (https://extract.hcmr.gr,
EXTRACT was originally developed specifically to facilitate metagenomic sample record annotation (
The latest version of EXTRACT (2.0,
In addition to curators benefitting from such a tool, knowledge-base developers can easily integrate the EXTRACT functionality into their own systems. To this end, we provide a robust and thoroughly documented Application Programming Interface (https://extract.hcmr.gr, FAQ section). EXTRACT can thus serve as a building block in large knowledge management pipelines, which also perform downstream tasks such as statistical entity association and association extraction, knowledge graph generation presenting the extracted associations, document indexing and information retrieval.
Such tasks lie at the core of the workshop this abstract has been submitted to and are pertinent to the TDWG 2017 theme, which is dedicated to the integration of species occurrence, gene, phenotype, and environment associations.
text mining, named entity recogntion, interactive curation, metadata, genes proteins, organisms, environments
Evangelos Pafilis