Biodiversity Information Science and Standards : Conference Abstract
|
Corresponding author: Roderic Page (roderic.page@glasgow.ac.uk)
Received: 30 Mar 2019 | Published: 13 Jun 2019
© 2019 Roderic Page
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Page R (2019) Text-mining BHL: towards new interfaces to the biodiversity literature. Biodiversity Information Science and Standards 3: e35013. https://doi.org/10.3897/biss.3.35013
|
The taxonomic literature is one of the largest resources of information on biodiversity, both current and in the past. Unlike many scientific disciplines this literature remains perpetually relevant as successive taxonomic work builds upon those earlier foundations. Projects such as the Biodiversity Heritage Library (BHL) have greatly increased access to that literature, as have numerous independent digitisation efforts by museums, herbaria, and publishers. But the focus of this access has been human readers, with limited use of text mining tools, mostly focussed on extracting taxonomic names. This talk explores other kinds of data that can be extracted from text on BHL and elsewhere, focusing on taxonomic names, geographic localities and specimen codes in the context of the BioStor project (https://biostor.org,
The problem of finding taxonomic names in text has been well studied (e.g.,
In addition to taxonomic names, a typical taxonomic paper often contains specimen codes. Extracting these from text and linking them to digital representations, such as occurrence records in GBIF, opens up the possibility to provide detailed provenance for occurrence data, as well as citation-based metrics for the utility of natural history collections.
Taxonomic papers are also often rich in geographic information. A simple method for extracting locality information from text is to search for latitude and longitude coordinates, and BioStor currently does this. To date some 83,000 individual point localities have been extracted (Fig.
A general framework for handling data on taxonomic names, specimens, and geographic localities in text is to treat them as annotations (
text mining, BHL, BioStor, taxonomic names, specimens, geocoding
Roderic Page