Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Ben Scott (b.scott@nhm.ac.uk)
Received: 06 Sep 2021 | Published: 07 Sep 2021
© 2021 Ben Scott, Laurence Livermore
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Scott B, Livermore L (2021) Extracting Data at Scale: Machine learning at the Natural History Museum. Biodiversity Information Science and Standards 5: e74031. https://doi.org/10.3897/biss.5.74031
|
The Natural History Museum holds over 80 million specimens and 300 million pages of scientific text. This information is a vital research tool to help solve the most important challenge humans face over the coming years – mapping a sustainable future for ourselves and the ecosystems on which we depend. Digitising these collections and providing the data in a structured, computable form is a mammoth challenge. As of 2020, less than 15% of available specimen information currently residing on specimen labels or physical registers is digitised and publicly available (
As part of SYNTHESYS+, the Natural History Museum is leading on the development of a cloud-based workflow platform for natural science specimens, the Specimen Data Refinery (SDR) (
Alongside specimens, digitised images of pages of scientific literature provide another vital source of data. Functional traits mediate the interactions between plant species and their environment and play roles in determining species’ range size and threatened status. Such information is contained within the taxonomic descriptions of species and a natural language processing library has been developed to locate and extract plant functional traits from these texts (
These two projects, like many other applications of ML in natural history collections, are focused on the extraction of visible information, for example, a piece of text or a measurable trait. Given the image of the specimen or page, a person would be able to extract the self-same information. However, ML excels in pattern matching and inferring unknown characters from an entire corpus. At the museum, we have started exploring this space, with our voyagerAI project for identifying specimens collected on historical expeditions of scientific discovery (e.g., the voyages of the Beagle and Challenger). This process fills in the gaps in specimen provenance and identifies 'lost' specimens collected by some of the most famous names in biodiversity history. Developing new applications of ML to uncover scientific meaning and tell the narratives of our collections, will be at the forefront of our scientific innovation in the coming years. This presentation will give an overview of these projects, and our future plans for using ML to extract data at scale within the Natural History Museum.
artificial intelligence, museums, informatics
Ben Scott
TDWG 2021