Proceedings of TDWG : Conference Abstract
|
Corresponding author: Qian Zhang (zhangqian06@gmail.com), Paul J. Morris (mole@morris.net)
Received: 16 Aug 2017 | Published: 16 Aug 2017
© 2017 Qian Zhang, Paul J. Morris, Timothy McPhillips, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Robert Morris, John Wieczorek
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Zhang Q, Morris P, McPhillips T, Hanken J, Lowery D, Ludäscher B, Macklin J, Morris R, Wieczorek J (2017) Using YesWorkflow hybrid queries to reveal data lineage from data curation activities. Proceedings of TDWG 1: e20380. https://doi.org/10.3897/tdwgproceedings.1.20380
|
The YesWorkflow
YesWorkflow also supports dynamic analysis and reporting on the results of the workflow (retrospective provenance) at various levels of granularity (e.g., at the actor level, script level, data level, record level, file level, function level), provided that it has been configured at each. YesWorkflow includes an @Log annotation, which describes the semantic structure of a log message within some actor in the workflow and allows the log message to be linked to the actor within which it was created, and for parts of that log message to be linked to the data passed between actors. YesWorkflow can be used to analyze the log messages after a run of the workflow and construct a store of facts, which can be queried and reasoned upon to make statements about the evolving paths taken by particular data elements through the workflow and assertions made about those data elements within the workflow.
Provenance, like other metadata, appears to be rarely actionable or immediately useful for those who are expected to provide it. However, by refactoring and integrating runtime observables generated from retrospective provenance and context information from prospective provenance analysis into hybrid queries, we show how both elements can yield hybrid visualizations that reveal “the plot” of the whole execution. In this way, a comprehensive workflow graph and a customizable data lineage report are made actionable for a workflow run with meaningful provenance artifacts. Queries run on a set of facts extracted from log messages by YesWorkflow after a workflow run, in combination with the facts extracted from the annotated workflow itself, allow for powerful visualizations of the retrospective provenance of a workflow run and of particular data records within a branching workflow.
Biodiversity Informatics, Data Quality, Workflows, Provenance
Qian Zhang