Proceedings of TDWG : Conference Abstract
|
Corresponding author: Gaurav Yeole (gauravyeole@ufl.edu)
Received: 15 Aug 2017 | Published: 15 Aug 2017
© 2017 Gaurav Yeole, Saniya Sahdev, Matthew Collins, Alex Thompson, Rebecca Dikow, Paul Frandsen, Sylvia Orli, Renato Figueiredo
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Yeole G, Sahdev S, Collins M, Thompson A, Dikow R, Frandsen P, Orli S, Figueiredo R (2017) A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens. Proceedings of TDWG 1: e20326. https://doi.org/10.3897/tdwgproceedings.1.20326
|
iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (
Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (
As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.
Biocollection Infrastructure, Cloud Computing, Neural Networks, Deep Learning
Matthew Collins