Data plays a crucial role in society today. With the cost of collecting, storing and processing data decreasing, more and more of it is getting collected and fed into complex analysis tools to obtain actionable results and insights. These are in turn being used to drive decisions that affect the lives of countless people in good ways and bad. It is imperative that data scientists properly record the provenance of the results they publish ie. they record the original sources of data and the exact sequence of operations performed on those sources to get to the published result. Doing so ensures that results are properly contextualized, and, more importantly, that they can be verified by other scientists. It also fosters collaboration, and leads to the standardization of common data operations and data transfer formats.
Unfortunately, this practice is not the norm in many scientific fields. We contend that this is the case because the tools available today for recording provenance information are inadequate. We presented a set of tools and systems for recording and publishing data provenance information to fill the void. These are built on top of the Git Version Control System and are geared towards data scientists of all research fields publishing the results of their research. We call this the Mezuri Data Provenance Management Platform, or Mezuri Provenance for short. Researchers can use these tools to annotate their existing data processing tools and workflows with provenance information. They can then publish this information potentially along with the actual implementation on our public registry.
Title
The Mezuri Data Provenance Management Platform
Published
2017-07-24
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
EECS-2017-131
Type
Text
Extent
59 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).