Biodiversity Information Science and Standards :
Methods
|
Corresponding author: Jennifer C. Girón Duque (entiminae@gmail.com)
Academic editor: Gail Kampmeier
Received: 04 Nov 2023 | Accepted: 17 Feb 2024 | Published: 06 Mar 2024
© 2024 Jennifer C. Girón Duque, Meghan Balk, Wasila Dahdul, Hilmar Lapp, István Mikó, Elie Alhajjar, Brenen Wynd, Sergei Tarasov, Christopher Lawrence, Basanta Khakurel, Arthur Porto, Lin Yan, Isadora E Fluck, Diego Porto, Joseph Keating, Israel Borokini, Katja Seltmann, Giulio Montanaro, Paula Mabee
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Girón Duque JC, Balk M, Dahdul W, Lapp H, Mikó I, Alhajjar E, Wynd B, Tarasov S, Lawrence C, Khakurel B, Porto A, Yan L, E Fluck I, Porto D, Keating J, Borokini I, Seltmann K, Montanaro G, Mabee P (2024) Meeting Report for the Phenoscape TraitFest 2023 with Comments on Organising Interdisciplinary Meetings. Biodiversity Information Science and Standards 8: e115232. https://doi.org/10.3897/biss.8.115232
|
The Phenoscape project has developed ontology-based tools and a knowledge base that enables the integration and discovery of phenotypes across species from the scientific literature. The Phenoscape TraitFest 2023 event aimed to promote innovative applications that adopt the capabilities supported by the data in the Phenoscape Knowledgebase and its corresponding semantics-enabled tools, algorithms and infrastructure. The event brought together 26 participants, including domain experts in biodiversity informatics, taxonomy and phylogenetics and software developers from various life-sciences programming toolkits and phylogenetic software projects, for an intense four-day collaborative software coding event. The event was designed as a hands-on workshop, based on the Open Space Technology methodology, in which participants self-organise into subgroups to collaboratively plan and work on their shared research interests. We describe how the workshop was organised, the projects developed and outcomes resulting from the workshop, as well as the challenges in bringing together a diverse group of participants to engage productively in a collaborative environment.
biodiversity, phenotype, biodiversity informatics, knowledge base, ontologies, Phenoscape
Trait data that are amenable to computational data science, including computation-driven discovery, remain relatively new to science. Efficiently repurposing, integrating and mining the vast stores of trait data have long been hampered by the limited amount of data accessible in standard formats and by the challenges involved with enabling machines to compute data that are largely recorded in natural language. A variety of resources have been developed to address these challenges, including powerful knowledge representation technologies (
Since 2007, the NSF-funded Phenoscape project*
The Phenoscape KB is an online resource that contains evolutionarily-relevant phenotypic trait data from over 250 comparative morphology studies to date and with a primary focus on the vertebrate fin-to-limb transition and comparative fish morphology (
The KB also offers an application programming interface (API) that enables exploration of connections amongst traits and between taxonomic groups. These include access to machine-reasoning-based algorithms, such as presence/absence reasoning for characters and states that are implied by, but not necessarily asserted in, original studies (
Phenoscape's current subproject, Semantics for Comparative Analysis of Trait Evolution (SCATE), is developing tools that use the KB's data and computational capabilities to assist in analyses of trait evolution (
At present, the Phenoscape KB primarily focuses on semantically-encoded vertebrate phenotypes, while the SCATE project centres on developing ontology-enhanced tools and techniques for assisting in phylogenetic comparative analyses. The infrastructure and toolset developed by Phenoscape and SCATE hold enormous potential for broader applications beyond vertebrate systems. To fully leverage this potential, broader adoption of Phenoscape/SCATE resources is needed by users and taxonomic experts from diverse communities. However, writing semantic descriptions of traits is a bottleneck for ontology-based knowledge bases (
In response to these challenges, Phenoscape/SCATE hosted TraitFest 2023, a global workshop held at the Renaissance Computing Institute (RENCI) in Chapel Hill, North Carolina, from 23-26 January 2023. The primary objective of the event was to engage potential users and contributors to the data and infrastructure provided by Phenoscape/SCATE, as well as developers of methods, especially in comparative phylogenetics and related fields. We aimed to include users whose research could benefit from computable semantics-based capabilities and whose taxonomic communities have already developed the necessary baseline knowledge representation infrastructure not currently present in the Phenoscape KB, such as for Arthropoda (
Here, we describe our approach to organising the event as a collaborative hands-on unconference-style workshop, based on the Open Space Technology (OST) methodology (
To facilitate inclusive decision-making and task sharing, an organising committee was assembled to include individuals broadly resembling the anticipated audience for the event. The full list of participants, including the organisers with their fields of expertise and interests can be found on the workshop wiki*
We invited people two ways: targeted invitations sent by members of the organising committee and by an open call for applications. We posted an open call for participation*
In advance of the workshop, participants were asked to complete several tasks that would help them prepare for the meeting and maximise the time available at the meeting for organising groups and working on group projects. It was critical to familiarise participants with the very wide range of interests and expertise held by the larger group and to provide a platform for participants to propose ideas ahead of the workshop. We set up a GitHub repository,*
We also set up a Slack channel for general information (SCATE TraitFest) and invited all participants and relevant RENCI meeting logistics personnel. Participants were encouraged to use Slack before the meeting, for example, to arrange shared transportation from the airport, during the meeting to share resources and after the meeting to continue collaborations.
The meeting was organised around the Open Space Technology (OST) concept in open science (
The workshop agenda*
The final, self-assembled groups*
Ad hoc bootcamp sessions for training or information sharing during the first couple of days of the workshop were also encouraged. Bootcamps were participant-requested, short, informal sessions led by participants in their area of expertise. Bootcamp topics included: ontologies and Phenoscape KB tools led by J. Balhoff; machine-learning from images led by A. Porto; curation of matrix-based phenotypic descriptions using Phenex (
A few days after the workshop, we shared a survey to learn the perceptions of the participants and how successful and productive the methodologies were.
The eight projects pursued by participants during the workshop are summarised below.
Project I: Images to Traits
Problem to solve: Images are constantly being generated in biodiversity research as an important source of information about characteristics (traits) of organisms and in the ongoing digitisation of biological collections. However, extracting trait information from those images is generally time-consuming, if even possible, such that humans cannot keep up with the volume of images generated daily. Thus, the trait information encoded in those images remains 'dark' (
Approach: The group of people included a botanist (I. Borokini), an entomologist (J. Girón) and two informaticians (B. Altintas and X. Wang). B. Altintas and X. Wang have generated pipelines for trait annotations in fishes, along with public views of these data. This group determined that the first thing needed was a paper compiling resources used for the different aspects of generating and using images, especially for machine-learning approaches. They spent time learning from each other the "why" and "how" they generate and process images for their work.
Results of the workshop: Along with the participation of several other workshop participants, they put together a document with a list of resources, tools and considerations when generating and using images for research in biodiversity. This document is available as a Google Doc and linked in the workshop's GitHub repository*
Future directions/plans/recommendations: Their plan is to keep working on the resources paper, from the point of view of both the biologist, who generates and uses images for particular purposes (often illustrative) and from the informatician's point of view, who knows the standards and processes needed to extract data out of those images in a format amenable for downstream computation. In addition, a discussion with a broader group of participants around the topic of documentation and metadata standards for image annotations became an after-workshop endeavour (M.A. Balk, W. Dahdul, J. Girón, A. Porto). This sub-group wants to engage the broader biodiversity standards and bio-ontologies communities in this conversation about documentation, metadata and standards (including AudioVisual Core *
Project II: Automated quantification of video and image files
Problem to solve: Research in the biological sciences is driven by a desire to provide important context and mechanisms for phenomena seen across the natural world. A common approach across and between disciplines is to assess the presence/absence of a trait (morphological, behavioural), by measuring and comparing the trait(s) of interest to establish a correlation. Regardless of the organism in question, the ability to generate reliable and readily comparable (homologous) measurements is critical to better understand the processes that influence the evolution of our ecosystems. However, data collection is often the most time and resource intensive period of many biological studies, due to a lack of access to specimens (e.g. lack of travel funds) and/or time required to learn measurement protocols and to determine viable, useful measurements within and across taxa. Data can be obtained from recorded images or videos from the organisms of interest. The goals of this project are to expedite the process of data acquisition and implement machine-learning approaches to automate the collection of measurements from large datasets. With faster, but still reliable data collection, researchers can focus more on processing and modelling.
Approach: This group was split into behavioural (video data) and morphological (image data) subgroups. As the primary focus of this group was acquisition of data, the group included many of the most junior individuals at the workshop, those who are still acquiring data for various research projects, including four graduate students (B. Khakurel, C. Charpentier, C. Lawrence, L. Yan) and one postdoctoral research fellow (B. Wynd). For each subgroup, the primary approach and goal of the workshop was to develop pipelines using existing tools to quantify morphology and behaviour. As these workshops tend to be short and communication can dwindle afterwards, the group prioritised the development of the pipelines to facilitate research and allow for continued development after the conclusion of the workshop. An additional, but secondary goal, for the image-subgroup was to evaluate the utility of point- (landmark) versus line- (vector) based approaches to quantifying measurements (
Results of the workshop: Both subgroups were able to establish working pipelines to assess their training datasets (which they brought with them).
The image subgroup was interested in generating landmarks and extracting linear measurements for teeth (B. Wynd, C. Charpentier, B. Khakurel). They were able to generate measurements on a training dataset of 50 images (both landmark and linear measurements) and then used ML-morph (
Simplified workflow of landmark automation pipeline. Logos are included for Computer Vision Annotation Tool (CVAT) and Make Sense AI, free annotation tools that easily feed directly into the pipeline. This project uses the ML-morph tool (
The video subgroup was able to generate video annotations using DeepLabCut (
Another effort this subgroup undertook was sifting through courtship behaviour data collected on jumping spiders (L. Yan). Key body parts of male spiders were labelled using DeepLabCut as well and fixed body parts were used for landmarks for procrustes analysis (a shape analysis that accounts for and unifies the same object appearing at different angles, size and angles in images/videos while removing the effect of size) to account for variations in spider shape and video variations. Having quantified the different types of spider behaviour into a multi-dimensional space, based on posture coordinates and primary motion measurements, this subgroup sought a way to code the space by stereotypical behavioural units. To do this, they reduced the dimensionality of the data to two dimensions for easier visualisation and for performing clustering algorithms. Several clustering algorithms (e.g. k-means (
Future directions/plans/recommendations: The data quantification project group has plans to continue annotating their biological data (images and videos) for use as training datasets. The image subgroup will be looking to publish a manuscript focusing on the pipeline and requirements for the training dataset, best practices in automating measurements and an assessment of the variance in linear- versus landmark-based measurements. The present goal is to publish a manuscript led by (and part of the dissertation of a graduate student) C. Charpentier. This manuscript will be a small step forward in the application of machine-learning to expedite the data collection process in landmark-based analyses of image data. The video subgroup will further look into details in quality control of annotated landmarks, especially inconsistencies between frames, to generate more accurate annotations that capture the differences between types of behaviour instead of filming methods (e.g. noisy background, individual size and angle differences). Additionally, it is important to develop a measure to account for consistent ArUco tag tracking with the presence of attenuation of the tag when the video is filmed at varied angles. The video group worked mostly with videos taken from natural or inconsistent backgrounds, representing the majority of animal behaviour recordings. Despite its prevalence, it is more challenging for automatic quantification than analysing model species and uniform background. The pipeline is aimed to be broadly applicable to naturalistic video-data quantification and set the stage for higher resolution of behaviour categorisation.
Project III: A Graph Approach to Understand Complexity in Species
Problem to solve: Complexity in living systems is an elusive concept usually defined in terms of a raw number of constituent elements, how they connect to each other and the number of hierarchical levels in which those elements can be organised. One alternative technique to describe such systems is to use graphs with anatomical entities represented as nodes and the relationships between them as edges. The goal is to use these constructed graphs as an approach to assess complexity across the tree of life. Over the course of the workshop, the group (D. Sasso Porto, E. Alhajjar, H. Lapp and J. N. Keating) developed a pilot pipeline for achieving such a task.
Approach: To develop and test our pipeline, this group retrieved the phylogenetic character matrix from
A) Graph representation of anatomical entities and their dependencies, obtained from the Phenoscape KB, for fish taxa in the family Characidae. B) Ancestral state reconstruction showing the evolution of phenotype integration within Characidae. Integration values for each species (tip) were obtained by calculating the edge density of each species subgraph (i.e. subgraph including only anatomical entities present in a species).
Results of the workshop: We created a pilot pipeline, PhenoNet, to study the evolution of complexity. The pipeline is available as an R script in the TraitFest2023 repository.*
Future directions/plans/recommendations: The pipeline developed in the workshop used semantic data from the Phenoscape KB as the pilot study; this pipeline can be applied to any dataset for which anatomical entities can be annotated with ontology terms from an anatomy ontology (e.g. Uberon;*
Projects IV, V & VI: Trait Repository: Creating a “GenBank” for Phenotypic Data
Problem to solve: Numerous individuals and projects display a keen interest in the study of phenotypes, encompassing morphological, anatomical, ecological and physiological characteristics. These traits play a crucial role in various fields, such as phylogenetic, evo-devo, population and ecological research. The scientific community has a few, emergent resources dedicated to trait data, for example, the Functional Trait Resource for Environmental Studies (FuTRES) datastore (
The ideal repository would possess the following key features:
Approach: Our group was composed of ontologists (J. Balhoff, M.A. Balk, P. Mabee), ontology-oriented entomologists (J. Girón, I. Mikó, G. Montanaro, M. Rossini, K. C. Seltmann, S. Tarasov) and ecologists (I. Fluck, A. Espindola). To address this issue, we developed a prototype repository named PhenoRepo*
In order to assess PhenoRepo's functionality, we chose to focus on two insect groups: Coleoptera (beetles) and Hymenoptera (bees, ants and wasps). The Coleoptera dataset mainly consisted of species of dung beetles, while the Hymenoptera dataset included various bee species, such as Agapostemon texanus Cresson, 1872 (family Halictidae),*
Bees were chosen as an exemplary group due to their crucial role as pollinators and the concerning global decline they are facing (
PhenoRepo Design. Using a semantic approach to describe phenotypes (employing terms sourced from relevant ontologies) has proven to be a potent tool for rendering phenotypes understandable and accessible to computers (
Submitting Data to PhenoRepo. To contribute trait data to PhenoRepo, users are required to represent their data semantically in the form of knowledge graphs. While these graphs can be constructed using the widely-used Protégé software, it may not be the most straightforward approach. Instead, we recommend using specialised software designed explicitly for this purpose, as follows:
By utilising these specialised tools, users can effectively construct semantic knowledge graphs, making the process of submitting trait data to PhenoRepo more efficient and seamless.
Results of the workshop: We created the instance-based repository PhenoRepo, a repository of semantic traits for any phenotype, including morphological, ecological and environmental data. Users were able to upload OWL files to the PhenoRepo GitHub repository.
PhenoRepo was tested with data from diverse sources, including two Darwin Core-formatted (
Additionally, we wrote a workflow and infrastructure to apply reasoning to the data. Environmental data and measurement data were also converted into Phenoscript format, which then was converted to OWL and uploaded to PhenoRepo. These Phenoscript files were also converted to human-readable Markdown syntax, which may be included in journal publications. We also managed to convert character matrices annotated in Phenex into PhenoRepo by using a custom Python script. Example input and output files can be found on the PhenoRepo and in our workshop presentation.*
Future directions/plans/recommendations: There is a significant need to expand our ability to include trait and phenotype data within a semantic framework. To do so, ontology resources need to be expanded and additional effort aligning existing ontologies is needed to make inferences across taxa. Already, M.A. Balk is working with the Ontology of Biological Attributes (OBA;
Indeed, PhenoRepo was a successful proof of concept, demonstrating a broader approach to taxon-independent phenotype database and workflow; however, more effort is needed to describe functional traits (
Project VII: Image extraction from literature
Problem to solve: Images are a useful tool in the biological sciences to convey visual information about organisms, including anatomical features and contrasting differences between species. The scientific literature contains many potentially useful images of organisms and, often, an accompanying caption with relevant textual information, such as a scale or trait descriptions. The ability to extract images and their accompanying captions can potentially greatly increase the amount of trait information accessible to researchers through resources, such as the Phenoscape KB. Our goal is to create a workflow to extract images and captions from PDFs into usable formats that can be used downstream to mine captions for anatomical terms and automatically annotate traits from images by using a machine-learning pipeline like ML-morph (
Approach: The group included participants with expertise or interest in ontologies and Phenoscape (M.A. Balk, W. Dahdul, J. Balhoff, D. Sasso Porto), a machine-learning expert (A. Porto) and persons interested in ChatGPT (A. Porto, J. Balhoff).
Both goals of this project required the extraction of images and captions from a collection of PDFs pertaining to bryozoans. A literature search previously compiled by A. Porto was used as the image and text corpus. As a test run, the most recent literature from 2016–2021 (441 PDFs) was used as these papers were more likely to have a standardised format and are computer-readable with Optical Character Recognition text. To extract images and captions from PDFs, they used PDFFigures 2.0, a Scala-based tool developed by AllenAI.*
The first goal was to extract trait terms from the textual descriptions in captions. This goal necessitated the creation of a new ontology for bryozoan anatomy and traits. The terms and species identities will feed into the Phenoscape KB to create character matrices. For the extraction of visual traits from images, we planned to utilise the machine-learning tools developed by A. Porto to automatically segment and landmark images of bryozoans (see
Results of the workshop:
The PDFFigures 2.0 application successfully extracted 2874 images in JPG format from the 441 PDF files. As the output images may be of different types (e.g. Scanning Electron Microscope (SEM) images, tables, maps), they performed a Principal Component Analysis (PCA) to group the images by type. The scripts and a demo are on the group's repository, lit-bryo.*
M.A. Balk created the Bryozoan Attribute Ontology (BAO)*
Terms and identifiers for ten morphological structures of bryozoans in the new Bryozoan Attribute Ontology.
Term | Identifier |
'pore chamber' | BRYO:0000001 |
'pore chamber cavity' | BRYO:0000002 |
'pore chamber plate' | BRYO:0000003 |
'primary orifice' | BRYO:0000004 |
'secondary orifice' | BRYO:0000005 |
'lophophore orifice' | BYRO:0000006 |
'lophophore opening' | BRYO:0000007 |
'lophophore feeding organ' | BRYO:0000008 |
'lophophorate feeding system' | BRYO:0000009 |
ovicell | BRYO:0000010 |
Future directions/plans/recommendations: The group's future plan is to create a complete workflow from image extraction to term generation and morphology analyses (Fig.
Workflow of trait extraction from figures from literature. Figures from PDFs are extracted using pdf2figures. This results in images and xml files of their captions. We then extract trait terms and species names for the Byrozoa ontology, which then feeds into Phenoscape to build trait presence-absence matrices. The extracted images are fed into the machine-learning programmes DeepBryo and ML-morph to automatically annotate images while maintaining metadata from the figure caption.
The group is also continuing to develop the BAO, with feedback from the OBO Foundry community (
Project VIII: ubeRsim : an R package to implement semantic similarity methods for pairwise and profile similarity
Problem to solve: RPhenoscape*
Approach: The group (H. Lapp, lead developer of RPhenoscape and J. Balhoff, lead developer of the Ubergraph RDF database) set out to create an R package that would re-implement the semantic similarity methods from the RPhenoscape package for both pairwise and profile similarity, by querying the public Ubergraph RDF database instance through its SPARQL endpoint.*
Results of the workshop: The group created a working proof-of-concept in the form of an R package tentatively called ubeRsim.*
The algorithm implemented in RPhenoscape for calculating semantic similarity scores uses matrix multiplication, which is very efficient in R, but requires a so-called subsumer matrix \((M_{i,j}=\begin{cases}1\;\mathrm{if}\ T_i\sqsupseteq T_j\\ 0\; \mathrm{otherwise}\end{cases})\). An endpoint in the Phenoscape KB API, which RPhenoscape queries, assembles and returns this matrix from a given list of input terms. In contrast, a SPARQL query of an RDF graph (as well as an SQL query of an equivalent table of graph edges) can only return an adjacency list. ubeRsim converts this to a subsumer matrix to enable keeping the same efficient matrix multiplication-based algorithm for computing semantic similarity scores.
They compared Jaccard similarity scores obtained through the ubeRsim implementation (and, thus, from subsumption subgraphs obtained from the public Ubergraph instance) with those returned from RPhenoscape for a list of select Uberon ontology terms (vertebrate limbs and paired and unpaired fins). They found that although the similarity scores were not numerically identical, they were similar and their relative order was mostly the same. These differences can, in part, be traced back to the upper ontologies included in the pre-reasoning, which differs between the two underlying databases and in the difference of anonymous OWL class expressions (such as "part_of some X", where X is a named term in an ontology) that are materialised in the Phenoscape KB, but are not in Ubergraph as a broader-purpose resource.
Future directions/plans/recommendations: The major directions for taking ubeRsim from its current proof-of-concept stage to a more widely-usable package in the comparative trait analysis ecosystem in R include the following: (1) generating "virtual" subsumer terms equivalent to anonymous OWL class expressions that would have been encountered as subsumers had they been materialised, so as to achieve equally discriminatory similarity scores as RPhenoscape and the Phenoscape KB; (2) providing user choice for which properties beyond subclass relationships to use for querying subsumption subgraphs; and (3) finding a mechanism or resource for obtaining broadly applicable term frequencies to enable information content-based similarity metrics.
Postworkshop Survey Responses
A post-workshop survey was sent to participants to understand their experience and evaluate the potential impact of the workshop. We received 10 highly positive responses to questions about whether the workshop was worth their time (4.9 average response on a 1-5 scale), whether they gained new knowledge (4.9) and whether they made connections with other participants that will enable new or better research endeavours (4.8). In free-form responses, participants also responded positively to the OST format and valued the unstructured work time and learning opportunities. On the other hand, a desire for more time in the initial project exploration phase, scheduled bootcamps so that everyone has an opportunity to attend and the desire for more social gatherings and meals together was noted.
Although the Phenoscape KB is broad in scope by including Uberon*
The use of Open Space Technology (OST): the good and the not so great. Giving people the freedom to choose which projects to work on and self-organise significantly distributed the weight of coordinating activities, which streamlined the flow of ideas by allowing groups to focus on particular tasks and people to shift to other projects when it seemed appropriate. By setting up the workshop wiki,*
Another aspect of the freedom offered by the OST format was that by working together in one place, people felt that there was time and space to get things done, even though the time was limited. For example, members within and between groups interacted with one another frequently, progressing projects along more quickly than if we were working asynchronously.
A diversity of topics for a diversity of users and backgrounds. During the introductions session, it seemed unclear how we would manage to get people with such a broad assortment of backgrounds to work together on innovating uses for the tools provided by the Phenoscape KB. Fortunately, participants came to the workshop willing to learn and share. The OST also allowed the integration of ideas originating from the different approaches to data that dissimilar expertises can bring and how the ideas became more solid by working together with people with varied perspectives.
Bootcamps: necessary distractions? The improptu bootcamps developed during the workshop made it so that all the participants went home with new knowledge besides the products of the developed projects, but, for some groups, the bootcamps distracted group members from contributing to achieving the goals of their projects. The bootcamps also empowered participants from different backgrounds to learn and understand what skills people had and have a better idea about how to integrate them.
Extending beyond our usual networks. Perhaps one of the greatest achievements of the event was exposing researchers who are experts in a particular domain to explore and learn from experts with very different backgrounds from their own. As researchers, we tend to attend the same kinds of events on a repeating basis, talking primarily with those in our own field. Bringing people from diverse backgrounds together and encouraging them to work in interdisciplinary groups resulted in a unique opportunity for professional development, especially for those who are in the earlier stages of their careers. The workshop allowed everyone to be the expert and be introduced to a completely new subject.
As we believe the results and observations we report here show, developing interdisciplinary meetings can be extremely productive, with both tangible and intangible outcomes greatly outweighing the organisational costs, even if these are substantial and especially so for projects that aim to integrate across knowledge domains.
Bringing people together with complementary knowledge, skills and interests and getting them to talk to each other and teach each other towards shared research objectives, is a powerful tool to expand perspectives and helps foster a sense of community and belonging.
Collaborative Research: ABI Innovation: Enabling machine-actionable semantics for comparative analyses of trait evolution
Awards 1661456 (Duke University), 1661529 (University of South Dakota), 1661516 (Virginia Tech), and 1661356 (University of North Carolina at Chapel Hill and RENCI). The attendance of S. Tarasov and D. S Porto was supported by the Research Council of Finland (#339576 and #346294).