Biodiversity Information Science and Standards :
Methods
|
Corresponding author: Anne E Thessen (annethessen@gmail.com)
Academic editor: Elycia Wallis
Received: 06 Jul 2022 | Accepted: 31 Aug 2022 | Published: 12 Oct 2022
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Thessen AE, Mozzherin D, Shorthouse DP, Patterson DJ (2022) Improving the discoverability of biodiversity data using the Global Names Finder. Biodiversity Information Science and Standards 6: e90026. https://doi.org/10.3897/biss.6.90026
|
The majority of biodiversity data is not findable, accessible, integratable, or reusable, partially because of a lack of metadata. Taxonomic names as metadata are useful, but not sufficient because these names may be updated as knowledge progresses. There is a great need for tools and services that can scale up to create and maintain metadata for the vast and varied long tail of dark data. Here we examine the use of GNFinder as a tool for creating and maintaining metadata using mentions of taxa in text from publications corresponding to data sets deposited in Dryad. Most studied taxa were mentioned in the publication using a properly formed scientific name, with a few exceptions for studies that only used vernacular names and only mentioned taxa in the corresponding files. GNFinder had a high F1 Score (0.86) representing a balance between precision (0.91) and recall (0.82). GNFinder had lower performance when a name string was an irregular abbreviation, had unexpected capitalization or punctuation, or contained a qualifier (like aff. or cf.). Approximately 14% of the name strings identified in text published from 1996 to 2012 were outdated and updated to a current, valid name. Automated metadata creation and maintenance at scale using GNFinder can make it easier to find biodiversity publications as demonstrated by the Biodiversity Heritage Library and HathiTrust.
taxonomic names, indexing, metadata, named entity recognition
Much attention has been given to the “data deluge” (
One unique aspect of biodiversity data is that scientific names can be used as near universal metadata (
Here we examine the feasibility of living metadata in biodiversity using GNFinder, a tool that can find scientific names in text with a high degree of precision and recall and return the corresponding current, valid name in JSON or CSV format. GNFinder was developed with the goal of processing everything ever published and is currently being used by the Biodiversity Heritage Library (BHL) (
This paper seeks to determine the efficacy of GNFinder for adding taxonomic metadata to the published literature. As a result, annotations are made and results reported at the document level. Multiple instances of the same name string in a document were only counted once.
Dryad is a repository for ecology and evolution data files that correspond to publications (
GNFinder is a web service that uses a combination of naive Bayes, rules, and lists to find scientific names in text (
Example output from GNFinder showing the results of the heuristic and statistical rules used by the naive Bayes algorithm to calculate the final score. In this example GNFinder identified the name Canis familiaris with high odds of being a taxonomic name based on the following criteria. This name is in two separate “go” lists (A and B). Both the genus and the specific epithet have endings that are common in Latin (C and D). The length of the specific epithet and the genus are within expected values (E and G). The name is not an abbreviation (F). All of these features were used by a naive Bayes algorithm to calculate the final “odds” score, in this case, 11.56. The Bayesian prior was set at 0.1 (H).
The score (result of naive Bayes) is represented as “odds” instead of a probability. GNFinder output can be configured to show the results from each of the rules and the final Bayesian score (Fig.
Once GNFinder has recognized a name in the text, the name string is parsed into its semantic elements such as genus name, specific epithet, year of publication, authorship, etc. using GNParser (
GNVerifier compares the name string found by GNFinder to names in a list of over 200 reference taxonomies (
Example GNVerifier score matching “Canis familiaris” found name string to Canis lupus familiaris Linnaeus, 1758 in the Catalogue of Life. The final score (G) is calculated based on the following seven attributes and used to sort results: A) Are the names uninomials, binomials, or trinomials? B) Do the names share an infraspecific rank, such as variety or form? C) Do the names match exactly? D) How carefully curated is the source of the matched name? E) Does the author and year information match? F) Is the found name a synonym of the matched name?
GNFinder can be accessed directly through the webpage (
Two human annotators found every unique name string used to refer to a taxon in every manuscript pdf for the 215 publications (
GNFinder returned all of the found name strings and their associated taxon concepts in a CSV file and in a JSON file (
To describe the advances made by GNFinder, we took a subset (17 randomly selected) of the 215 publications and calculated performance metrics using several other published name-finding tools: TaxonFinder (
The subset of publications used to compare GNFinder performance to other, similar tools was also used to explore the utility of GNFinder for creating metadata. To test this, we created a list of taxa represented by all of the name strings recorded by the annotators from the publication and the corresponding data files in Dryad. For each data package (publication and data files) we calculated the total number of taxa present, the taxa only represented in the data files, the taxa only represented by a vernacular name, and the taxa only represented as an improperly formed scientific name. These lists included higher level taxa that appeared in the text or data, or as a vernacular or a scientific name, even when a child taxon was present. Paraphyletic taxa referred to by a vernacular name where counted as being represented by a vernacular name only unless all of the scientific names implied by that vernacular name were also present (e.g., barrel cactus is a paraphyletic group including Echinocactus and Ferocactus).
To explore the prevalence of outdated names in the literature, we examined the results from GNVerifier. Any names that were found to be exact matches for synonyms (i.e., matchType = Exact and isSynonym = True) were considered, for the purpose of this exercise, as outdated names even though they may reflect different taxonomic preferences of the sources.
To test annotator agreement in recognizing name strings, 27 manuscript pdf files were processed by both annotators. Vernacular names were not included, but abbreviations of scientific names were included. A Cohen’s kappa coefficient (
GNFinder performance was calculated for 215 manuscripts (Table
GNFinder made 2,559 unique errors, most of which were false negatives (70%) due to GNFinder not being able to read figures, trinomial abbreviations (such as L. g. confertiflora), unusual formatting and punctuation used to save room in tables, and parentheses in names (such as Nanorana (Paa) bourreti). GNFinder is not designed to perform well on virus names. Properly formed abbreviations, such as C. familiaris were returned and parsed by GNFinder, but were not verified.
GNFinder had the highest F1 Score (Table
For the majority of this subset of the 215 publications, all of the taxa were referenced in the publication, but one data package had 78% of taxa appearing in the data file only (Table
Total Number of Taxa |
Taxa in manuscript (%) |
Taxa in data files only (%) |
Taxa as vernaculars only (%) |
Taxa as irregular names only (%) |
116 |
98.3 |
1.72 |
6.9 |
0.0 |
27 |
100.0 |
0.0 |
18.5 |
0.0 |
137 |
99.3 |
0.7 |
0.7 |
0.0 |
10 |
100.0 |
0.0 |
50.0 |
0.0 |
37 |
100.0 |
0.0 |
0.0 |
0.0 |
49 |
100.0 |
0.0 |
18.4 |
2.0 |
26 |
100.0 |
0.0 |
3.8 |
3.6 |
18 |
100.0 |
0.0 |
55.6 |
0.0 |
36 |
100.0 |
0.0 |
5.0 |
0.0 |
19 |
100.0 |
0.0 |
68.4 |
0.0 |
127 |
21.3 |
78.7 |
5.5 |
0.0 |
18 |
100.0 |
0.0 |
22.2 |
0.0 |
37 |
100.0 |
0.0 |
18.9 |
0.0 |
56 |
100.0 |
0.0 |
8.9 |
8.9 |
43 |
100.0 |
0.0 |
14.0 |
2.3 |
12 |
100.0 |
0.0 |
91.7 |
0.0 |
5 |
100.0 |
0.0 |
100.0 |
0.0 |
Of the 8,710 names returned by GNFinder from 215 publications, 1,258 were updated to a current name according to Catalogue of Life (default setting) by GNVerifier (14.4%). The manuscripts containing these names had been published from 1996 to 2012 with most published in 2012 (41%) and 2011 (32%).
Data are rendered non-discoverable because of the ways taxonomic names change over time and because of the idiosyncratic ways in which names are expressed. The Global Names project recognizes that names may be expressed in various forms, and the infrastructure has been designed so that we can extend GNFinder to parse additional variant forms (
GNFinder can find scientific names in text and resolve name strings to a current name in a user-chosen list. This is also known as Named Entity Recognition (NER) and is a very active area of research in the Natural Language Processing and Machine Learning fields (
The name-resolution function performed by GNFinder also serves as quality control for resources like BHL, which have used Optical Character Recognition (OCR) as part of the digitization process. OCR can introduce errors in names at rates that depend heavily on the language and typography used (historical texts are particularly vulnerable) (
Not all of the 11,692 unique name strings identified by human annotators were properly formed scientific names and their regular abbreviated forms. A properly formed scientific name, for the purposes of this paper, includes a binomial (Panthera leo), trinomial (Felis silvestris lybica), or higher level taxon name with or without the authority and the regular abbreviation (P. leo). This is important because the semi-supervised portion of GNFinder relies on the rules of nomenclature to identify scientific names in text. Out of all of the documented ways a taxonomic name can be represented (
Irregular abbreviation. Irregular abbreviations were scientific names shortened by any means except: a) the first one or two letters of the generic name with the first capitalized, b) followed by a full stop and a space, c) followed by the specific epithet. Often these included names with strain designations or location information. While regular abbreviations were identified by GNFinder, they were not resolved by GNVerifier.
Unusual punctuation or spacing. Unconventional spacing and punctuation can be used to represent hybrids, species complexes, or unofficial specific epithets such as Aus bus × cus, Aus bus/cus, and Aus “bus”.
Improper capitalization. Publications will sometimes contain a genus name that is not capitalized or a specific epithet that is capitalized, such as E. Caballo. GNFinder needs the capitalization to recognize the genus name and specific epithet.
Adding two letter qualifier abbreviations. Manuscripts often have qualifiers added to the names, such as cf. aff. or sp. When these abbreviations occur within the name string, GNFinder will not recognize the binomial. When they occur after the name string, such as Bos sp., GNFinder will include the sp. in the returned name string.
The benefits of including all mentioned taxa as metadata are unclear because a paper may be about one specific taxon, but mention several; so, including all mentioned taxa could lead to less precise document retrieval. Author-supplied keywords and algorithms that can detect keywords from text (
The utility of including mentions of taxa above the rank of genus is also unclear. The argument against this is that parent taxa can be automatically added from an authoritative hierarchy when a taxon is detected; thus, keeping the search criteria broad enough to include them decreases precision of the algorithm and the document search unnecessarily. The arguments for this are the cases where only higher level taxa are mentioned and in the cases where more than one genus has the same name. A path forward is to add both types of higher level taxa (i.e., found and inferred) to the metadata file and label them appropriately.
These results suggest that the majority of relevant taxa are mentioned in the publication and thus searching the publication file will generate most of the needed taxonomic metadata for the accompanying data. This argues that the first priority for future GNFinder development should be improving the extraction of names from published manuscripts, especially proper handling of names in figures. It is known that not all data are published (
Taxonomic names are useful metadata for finding, accessing, integrating, and reusing data, but only if they can be effectively resolved when there are changes in taxonomy, or when a name represents more than one species concept. GNFinder demonstrates good overall performance on finding name strings that occur in text representing taxa across the tree of life. In this study, approximately 14% of names used in publications 10–20 years old were out-of-date and were mapped to a current, valid name by GNVerifier. Furthermore, the speed of GNFinder makes it possible to apply names as living metadata to the entire body of published literature. Without name-finding algorithms, much biological content cannot be accessed by searches based on the taxon name. The use of GNFinder to tag files with appropriate taxonomic metadata improves discovery on an unprecedented scale. The major advance of GNFinder is the almost unlimited scalability and reliability, while still preserving reasonably high quality of name detection.
This work was supported by NSF grants ABI 1062387 Collaborative Research: ABI: Innovation: The Global Names Architecture, an infrastructure for unifying taxonomic databases and services for managers of biological information and ABI 1356347 ABI Development: Global Names Discovery, Indexing and Reconciliation Services. We thank Lakshmi Akella and Patrick Leary for access to software.
NSF ABI 1062387 and 1356347
Collaborative Research: ABI: Innovation: The Global Names Architecture, an infrastructure for unifying taxonomic databases and services for managers of biological information and ABI Development: Global Names Discovery, Indexing and Reconciliation Services.
Contributions made by David Shorthouse represent work initiated prior to his employment with Agriculture and Agri-Food Canada
=true positives / (true positives + false positives
=true positives / (true positives + false negatives)
=2 * (precision * recall)/(precision + recall