Recent trends in science and technology augur a rapid increase in the number of computations being employed by scientists. Accompanying increased volumes are growing expectations for the tools that scientists use to handle their computations. These increased volumes and expectations present a new set of problems and opportunities in computation management. In this thesis, I propose Data Centric Scientific Workflow Management Systems (DSWMSs) to address these issues. DSWMSs supersede current approaches by leveraging a deeper understanding of the data manipulated by computations to provide new features and improve usability and performance. Examples of such features include data provenance, work sharing, and interactive computational steering. In this thesis, I make several contributions towards realizing the concept of a DSWMS. First, in conjunction with scientists from several scientific domains, I propose a set of services that are not provided by current paradigms, but are made possible in DSWMSs. Second, I define an abstract model, the Functional Data Model with Relational Covers (FDM/RC), for representing scientific workloads and a language for defining and manipulating instances (schemas) of the model. Third, I design and implement GridDB, a prototype DSWMS. GridDB is deployed on a large cluster at Lawrence Livermore National Laboratories where it runs science applications at real-world scales. The deployment uncovers a pair of technical problems involving the provisioning of data provenance and memoization (computational caching) so I also contribute solutions to these problems.




Download Full History