Proceedings of TDWG : Conference Abstract
|
Corresponding author: Katie Mika (kmika@fas.harvard.edu)
Received: 20 Jul 2017 | Published: 25 Jul 2017
© 2017 Katie Mika
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Mika K (2017) Crowdsourcing Data Enhancements to Improve Named Entity Recognition in the Biodiversity Heritage Library. Proceedings of TDWG 1: e17354. https://doi.org/10.3897/tdwgproceedings.1.17354
|
The Biodiversity Heritage Library's holdings include dozens of manuscript collections that are largely hidden due to minimal descriptive metadata and the absence of machine readable facsimiles. Transcription projects for collections are time consuming, intellectually intensive, and expensive for an organization to facilitate. Crowdsourcing has been identified as a sustainable model for generating transcriptions for large collections and institutions with diverse holdings and may improve data collection from a diverse range of users to enhance descriptive metadata. Manuscript transcriptions also must fit into the larger BHL objective of producing and making available large scale datasets for users and researchers to study and manipulate. While generating only full-text transcriptions will improve discoverability, items remain isolated from other literature and collections. The GNA machine learning and named entity recognition algorithms allow BHL to extract, index, and attach page records to scientific names in order to create additional access points to biodiversity literature. These tools depend on imperfect optical character recognition (OCR), which contain spelling and layout errors and outdated naming conventions and can be added to BHL’s larger crowdsourcing platform to solicit corrections. Transcriptions, similarly, introduce antiquated taxonomic data and common names and must be optimized for use with GNA’s name finding tools. Improving named entity recognition by correcting OCR output and enhancing transcribed manuscript items will allow BHL to better connect content across collections and ultimately provide a broader and more complete picture of biodiversity.
Crowdsourcing, Named Entity Recognition, Machine Learning, Optical Character Recognition, Manuscript Transcription
Katie Mika