|
Biodiversity Information Science and Standards :
Conference Abstract
|
|
Corresponding author: Atsuko Takano (takano@hitohaku.jp)
Received: 28 Sep 2024 | Published: 30 Sep 2024
© 2024 Atsuko Takano, Yasuhiko Horiuchi, Hajime Konagai, Chung-Kun Lee, Hiromune Mitsuhashi
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Takano A, Horiuchi Y, Konagai H, Lee C-K, Mitsuhashi H (2024) Development of an Automated Label Data Entry System from Herbarium Specimen Images at Hyogo Herbarium (HYO). Biodiversity Information Science and Standards 8: e138060. https://doi.org/10.3897/biss.8.138060
|
|
We would like to introduce our recently developed systems for taking images of herbarium specimens and for the automatic extraction of data from specimen labels at the Herbarium of the Museum of Nature and Human Activities, Hyogo, Japan (HYO).
Firstly, we designed a low-cost, but high-quality specimen imaging system for non-professional photographers to obtain images rapidly (
Next, we developed a system to extract label information from specimen images. The specimen image was uploaded to Google OCR and data were extracted in the form of text. Uploading the whole specimen image decreased the reading accuracy of the software because the plant images behaved as OCR (Optical Character Reader) noise. Therefore, the label part was cut out from the whole specimen image by using D-Lib*
Finally, we decided to develop a system that would automatically label the text data extracted by OCR and input them into the appropriate cells of the database. Even though the text data could be extracted from specimen images, it needed a human to input them into the database. Therefore, we adopted Named Entity Recognition (NER), a system that extracts named entities such as place names, identifying proper nouns from unstructured text data. It enables information recorded in herbarium specimens to be tagged as named entities. We tried text matching at first, but the result was not satisfactory, so we started to use machine learning instead. We compared three natural language libraries for Japanese: BERT (Bidirectional Encoder Representations from Transformers), Albert (A Lite version of BERT), and SpaCy. Despite BERT and SpaCy returning similarly high f-scores (indicating good performance), we decided to use SpaCy because it runs better on ordinary PCs or servers. With sufficient machine learning after the creation of a text corpus (a specialised dataset) specific to labels on herbarium specimens, we successfully developed the application. The project files are available on GitHub*
We then examined whether this system could be applied to non-plant specimen images, i.e., fishes or birds, and found that it could efficiently extract data. Therefore, we decided to publicize this system on the cloud server and share it with other natural history museums in Japan*
The system mentioned above is specialized for the natural history collections of Japan, but we believe it is possible to build similar programs in other countries, and we hope our experience will contribute to the mobilization of the world’s natural history collections.
named entity recognition, optical character recognition, digitization
Atsuko Takano
SPNHC-TDWG 2024