Biodiversity Information Science and Standards : Conference Abstract
Print
Conference Abstract
A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs
expand article infoClaus Weiland, Maxat Kulmanov§, Marco Schmidt‡,|, Robert Hoehndorf§
‡ Senckenberg Biodiversity and Climate Research Centre, Frankfurt am Main, Germany
§ King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| Palmengarten der Stadt Frankfurt am Main, Frankfurt am Main, Germany
Open Access

Abstract

Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. This opens up an increasing amount of data on wild type organisms, which enables the building of large biodiversity knowledge graphs comprising, inter alia, sequence, trait and occurrence data. Knowledge graphs model information in the form of entities and their relationships expressed in good practice as ontology-based annotations. Based on ontological descriptions, semantic similarity analysis makes linking of wild type data to genomic and proteonomic data of model organisms possible and thus supports knowledge discovery of crop wild relatives and underutilized species of interest for medicine, breeding and agriculture. Since classical similarity measurements focus on recording differences between character states (aiming to describe disease phenotypes), but not the character states in the sense of trait variations itself, new methods for similarity search are required. Machine learning algorithms operate on feature vectors, which are numeric representations of data (images, class labels etc) in n-dimensional vector space. We established a machine learning based workflow for similarity search on biodiversity entities using feature learning on ontologies and an associated RDF knowledge graph to project structured trait data into vector space. Vectors are then compared applying a similarity function (e.g. cosine similarity) to determine similarity between taxa based on trait semantics. We will present an application example of machine learning on biodiversity knowledge graphs using a pipeline built upon OPA2Vec, a method to generate feature vectors from the logical content of ontologies (Smaili et al. 2018), to successfully cluster plant species for life form and ecotype (e.g. tree vs. perennial plant) on the basis of their annotations with the Flora Phenotype Ontology (Hoehndorf et al. 2016).

Keywords

semantic similarity, machine learning, trait semantics, phenotype ontology, knowledge graph

Presenting author

Claus Weiland

Presented at

Biodiversity_Next 2019

References