Biodiversity Information Science and Standards :
Conference Abstract
|
Corresponding author: Daniel G. Mulcahy (mulcahyd@si.edu)
Received: 08 Sep 2022 | Published: 09 Sep 2022
© 2022 Daniel Mulcahy
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Mulcahy DG (2022) Specimen Identifiers: Linking tissues, DNA samples, and sequence data to voucher specimens in publicly accessible databases . Biodiversity Information Science and Standards 6: e94625. https://doi.org/10.3897/biss.6.94625
|
|
Nearly all disciplines of biology now have some form of molecular genetic analyses incorporated into areas of their research, from systematics, ecology, and behavior, to physiology and conservation. In order for science to be transparent, the source and provenance of the genetic material used must be easily identifiable and traceable, following the FAIR principles of being Findable Accessible, Interoperable, and Reusable (
Many natural history collections are also now using digital management systems, where digital identifiers such as Digital Object Identifiers (DOIs) and Uniform Resource Identifiers (URIs) are assigned to objects in collections (
The National Center for Biotechnology Information (NCBI), which hosts GenBank, has created a BioCollections Database to curate metadata for natural history collections and linking sequence data to voucher specimens (
The NCBI BioCollections Database curators have resolved the duplicate institution codes problem, by adding the three-letter country code (or state code, within the same country). However, this database is used only for sequence data, in GenBank, and related databases (e.g., ENA, DDBJ), which raises the question, is there a need for a more universal biocollection codes database? Additionally, as museums move towards using digital identifiers, in the place of catalog numbers, confusion can arise when multiple digital identifiers are assigned to parts of the same “specimen” (e.g., specimen voucher, tissue, DNA, images, etc.). For instance, if a given specimen has unique URIs for the voucher specimen, the DNA, and an image, a researcher borrowing the DNA, might use the DNA URI as an identifier for the genetic database. A different researcher, at a later date, might see that specimen (or image) in the museum’s collection, and think it is a different specimen of that species, when in fact it is the same specimen. This could result in a second researcher borrowing the sample and publishing it as a “new” sequence. Researchers already have difficulties in submitting sequences to GenBank, as several have confused field numbers for catalog numbers from the National Museum of Natural History, Smithsonian Institution (
Some museums, using modifiable codes, can append a primary code (from the specimen voucher) for additional “parts” of that specimen. For example, if the primary code for an insect specimen ends in “…6d15ce”, a leg taken for DNA extraction could be modified as “…6d15ce_leg” and “…6d15ce_dna” for the extract. This minimizes the chances for mistaking these as being from different specimens. However, if completely different codes are assigned to different parts of the same specimen, the chance increases for mistaking two objects from the same specimen as being from different specimens.
Collections staff must carefully consider the hierarchical relationships of objects in their collections, and how they are assigned URIs, especially when considering long-term operability in current and future aggregate database structures (e.g., GBIF, GGBN, NCBI, and the DES).
In this presentation, these issues are raised and the difficulties in linking specimens, genomic resources, and associated data in aggregate databases and data repositories are discussed.
institution code, collection code, catalog number, digital identifiers
Daniel G. Mulcahy
TDWG 2022