Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Ben Scott (b.scott@nhm.ac.uk)
Received: 27 Jul 2022 | Published: 01 Aug 2022
© 2022 Ben Scott
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Scott B (2022) Cloud AI: A comparison of specimen image data extraction processes. Biodiversity Information Science and Standards 6: e90951. https://doi.org/10.3897/biss.6.90951
|
|
The Natural History Museum (NHM) of London has embarked on an ambitious programme to digitise the 80 million specimens in its collection, releasing them through the NHM data portal and the global biodiversity research community. As part of the digitisation process, data is transcribed from specimen labels to capture the vital taxonomic and collection event data. Accurate human transcription is slow and the NHM, like many institutions, has been exploring machine learning (ML) for automated specimen analysis and label data capture. This process requires many different models, chained in series: semantic segmentation to identify specimen and label regions of interest; optical character recognition to identify text on labels; natural language processing to extract entities from the text.
As part of SYNTHESYS+, the NHM has been building the Specimen Data Refinery (SDR) (
machine learning, cloud computing, digitisation, natural history collections, specimen data
Ben Scott
TDWG 2022