Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Takeru Nakazato (nakazato@dbcls.rois.ac.jp)
Received: 28 Sep 2020 | Published: 29 Sep 2020
© 2020 Takeru Nakazato
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nakazato T (2020) Survey of Species Covered by DNA Barcoding Data in BOLD and GenBank for Integration of Data for Museomics. Biodiversity Information Science and Standards 4: e59065. https://doi.org/10.3897/biss.4.59065
|
DNA barcoding technology has become employed widely for biodiversity and molecular biology researchers to identify species and analyze their phylogeny. Recently, DNA metabarcoding and environmental DNA (eDNA) technology have developed by expanding the concept of DNA barcoding. These techniques analyze the diversity and quantity of organisms within an environment by detecting biogenic DNA in water and soil. It is particularly popular for monitoring fish species living in rivers and lakes (
I downloaded all GenBank files for vertebrates from NCBI FTP (File Transfer Protocol) sites (as of November 2019). Of the GenBank fish entries, 86,958 (7.3%) were assigned BOLD identifiers (IDs). The NCBI taxonomy database has registrations for 39,127 species of fish, and 20,987 scientific names at the species level (i.e., excluding names that included sp., cf. or aff.). GenBank entries with BOLD IDs covered 11,784 species (30.1%) and 8,665 species-level names (41.3%).
I also obtained whole "specimens and sequences combined data" for fish from BOLD systems (as of November 2019). In the BOLD, there are 273,426 entries that are registered as fish. Of these entries, 211,589 BOLD entries were assigned GenBank IDs, i.e. with values in “genbank_accession” column, and 121,748 entries were imported from GenBank, i.e. with "Mined from GenBank, NCBI" description in "institution_storing" column. The BOLD data covered 18,952 fish species and 15,063 species-level names, but 35,500 entries were assigned no species-level names and 22,123 entries were not even filled with family-level names. At the species level, 8,067 names co-occurred in GenBank and BOLD, with 6,997 BOLD-specific names and 599 GenBank-specific names.
GenBank has 425,732 fish entries with voucher IDs, of which 340,386 were not assigned a BOLD ID. Of these 340,386 entries, 43,872 entries are registrations for COI genes, which could be candidates for DNA barcodes. These candidates include 4,201 species that are not included in BOLD, thus adding these data will enable us to identify 19,863 fish to the species level.
For researchers, it would be very useful if both BOLD and GenBank DNA barcoding data could be searched in one place. For this purpose, it is necessary to integrate data from the two databases. A lot of biodiversity data are recorded based on the Darwin Core standard while DNA sequencing data are sometimes integrated or cross-linked by RDF (Resource Description Framework). It may not be technically difficult to integrate these data, but the species data referenced differ from the EoL (The Encyclopedia of Life) for BOLD and the NCBI taxonomy for GenBank, and the differences in taxonomic systems make it difficult to match by scientific name description. GenBank has fields for the latitude and longitude of the specimens sampled, and
metabarcoding, environmental DNA, data integration, semantic web
Takeru Nakazato
TDWG 2020
ROIS-DS-JOINT (004RM2017 and009RM2018), and The Life Science Database Integration Project
The author has declared that no competing interest exists.