Biodiversity Information Science and Standards : Conference Abstract
Conference Abstract
Specimen Data Refinery: A landscape analysis on machine learning, computer vision and automated approaches to capture specimen metadata
expand article infoLaurence Livermore, Robert W. N. Cubey§
‡ The Natural History Museum, London, United Kingdom
§ Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom
Open Access


Capturing data from specimen images is the most viable way of enriching specimen metadata cheaply and quickly compared to traditional digitisation. Advances in machine learning and computer vision-based tools, and their increasing accessibility and affordability, are greatly increasing the potential to take automated measurements and capture other data from specimens themselves, as well as to transcribe label data.

More sophisticated segmentation of images allows us to find parts of interest: particular labels; individual specimens on a slide; or barcodes. Following segmentation, there is the potential to use colour analysis of specimens to perform conditional checking, such as looking for bad cases of verdigris in pinned insects or discoloration of gum-chloral mountant. Automating measurements and landmark analysis of specimens can be used to create trait datasets, all of which will enrich our knowledge of specimens. Segmentation of labels can allow us to cluster similar labels based on their visual properties including colour, shape and patterns—this in turn can be used to make optical character recognition, handwriting recognition and manual transcription much more efficient. Atomising, validating and resolving label data will create structured label data that can be more easily stored, searched and linked to other datasets.

We present a landscape analysis on the approaches, summarising previous work, and outline our plan to build future tools and systems in the SYNTHESYS+ Project as part of the Specimen Data Refinery. This will cover the sharing of tools, reducing barriers to access, integrating workflow engines into a software architecture that allows the components to be re-used and re-purposed with provenance data for repeatability, and conforms with the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles (Wilkinson et al. 2016).


machine learning, digitisation, automation, natural history specimens

Presenting author

Laurence Livermore

Presented at

Biodiversity_Next 2019

Funding program

The Specimen Data Refinery is part of SYNTHESYS+ and funded from the RIA - Research and Innovation action in the H2020-EU. Programme - Integrating and opening existing national and regional research infrastructures of European interest.