Biodiversity Information Science and Standards : Conference Abstract
Conference Abstract
The Specimen Data Refinery: Using a scientific workflow approach for information extraction
expand article infoLaurence Livermore, Paul Brack§, Ben Scott, Stian Soiland-Reyes§, Oliver Woolland§
‡ The Natural History Museum, London, United Kingdom
§ The University of Manchester, Manchester, United Kingdom
Open Access


Over the past three years, we have been developing the Specimen Data Refinery (SDR) to automate the extraction of data from specimen images as part of the SYNTHESYS project (Walton et al. 2020). The SDR provides an easy to deploy, open source, web-based interface to multiple workflows that enable a user to create new or enhance existing natural history specimen records. The SDR uses the Galaxy workflow platform as the basis for managing data analysis, and where possible, using existing Galaxy community tools and approaches (Jalili et al. 2020, Hardisty et al. 2022). We have developed a library of domain-specific tools including semantic segmentation, optical character recognition, hand-written text recognition, barcode reading and natural language processing. These tools have been designed to work on standardised images of specimens, specifically herbarium sheets, pinned insects and microscope slides.

In this presentation, we provide our technical approach in developing the SDR, including the Galaxy workflow platform, application deployment, and tool interoperability, using FAIR digital objects (e.g., RO-Crates and openDigital Specimen objects (Soiland-Reyes et al. 2022, Addink and Hardisty 2020)). We present an evaluation of the tools, including segmentation, text recognition, and others, and the new challenges in using the resulting data from both a technical and social perspective.


Galaxy workflow platform, automation, natural history specimens, digitisation

Presenting author

Laurence Livermore

Presented at

TDWG 2022

Funding program

H2020-EU. - Integrating and opening existing national and regional research infrastructures of European interest

Grant title

SYNTHESYS PLUS – "Synthesis of systematic resources", Grant Agreement No. 823827

Author contributions

Author contributions to this article according to the Contributor Roles Taxonomy CASRAI CrEDiT:

  • Laurence Livermore: Conceptualization, Data curation, Funding acquisition, Methodology, Project administration, Resources, Writing – review & editing.
  • Paul Brack: Conceptualization, Software.
  • Ben Scott: Data curation, Software, Validation.
  • Stian Soiland-Reyes: Investigation, Methodology, Supervision, Writing – review & editing.
  • Oliver Woolland: Data curation, Resources, Software, Visualization, Writing – review & editing.


login to comment