Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Emilio Berti (emilio.berti@idiv.de)
Received: 16 Sep 2021 | Published: 17 Sep 2021
© 2021 Matthias Grenié, Emilio Berti, Juan Carvajal-Quintero, Marten Winter, Alban Sagouis
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Grenié M, Berti E, Carvajal-Quintero JD, Winter M, Sagouis A (2021) Matching Species Names Across Biodiversity Databases: Sources, tools, pitfalls and best practices for taxonomic harmonization. Biodiversity Information Science and Standards 5: e75359. https://doi.org/10.3897/biss.5.75359
|
The quantity and quality of ecological data have rapidly increased in the last decades, bringing ecology into the realm of big data. Frequently, multiple databases with different origins and data characteristics are combined to address new research questions. Taxonomic name harmonization, i.e., the process of standardizing taxon names according to common sources such as taxonomic databases (TD), is necessary to properly combine multiple databases using species names. In order to be able to develop proper data matching workflows, TDs and tools using them need to be clearly and comprehensively described. But this is rarely the case. Common problems users have to deal with are: poorly described taxonomic concepts behind biological databases, lack of information when TDs are actively updated, and details regarding where the primary source of taxonomic information comes from (e.g., secondary TDs taking information from primary TDs). In addition, software to access these TDs is not always advertised, partly redundant, or developed with incompatible standards, creating additional challenges for users. As a result, taxonomic name harmonization has become a major difficulty in ecological studies. Researchers face a jungle of primary and secondary TDs with a diversity of tools to access them and no clear workflow on how to practically proceed. As a consequence, it is hard for users to know which TD, tool and workflow will fit the task at hand and lead to the most robust results when combining different biological datasets.
Here, we present an overview of major TDs as well as an extensive review of R packages to access TDs, and to harmonize taxa names. We developed an R Shiny web application summarizing meta-data and linkages among TDs and R packages (Figs
First screenshot of the interactive Shiny application to explore taxonomic databases and R packages to access them. On the bottom, a table of the available databases and packages is displayed with information about their taxonomic coverage. The search bar can be used to create a subset of the taxonomic group of interest (plants in this case). On the top, information about the chosen database or package is displayed.
Second screenshot of the interactive Shiny application to explore taxonomic databases and R packages to access them, showing the network of connections among them. Packages accessing a taxonomic database (Tropicos, in this case) are displayed in blue; arrows from packages to other databases indicate that these packages can access other taxonomic databases. Databases are displayed in yellow, with arrows indicating if information from a database is used to populate another database.
To our knowledge, this study represents the most exhaustive review of TDs and R tools for taxonomic name harmonization. Our intuitive Shiny app can help make practical decisions when harmonizing taxonomic names across multiple datasets. Finally, our proposed workflows, based on conservative guideline principles (e.g., making sure incompatible taxonomic hypotheses are not combined together), provide a hands-on approach for taxonomic harmonization, which focuses on the quality of the end results while maximizing the number of species correctly matched.
taxonomy, standardization, backbone, taxonomic reference, R packages, workflow, guidelines
Emilio Berti
TDWG 2021
German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Puschstraße 4, 04103 Leipzig, Germany