Pay-as-you-go Data Cleaning and Integration

Jeffery, Shawn; EECS Department, University of California

PDF

Description

Many emerging applications such as Web mash-ups and large-scale sensor deployments seek to make use of large collections of heterogeneous data sources to enable powerful new services. These sources range from traditional sources such as relational databases to emerging sources such as structured data on the Web and streaming sensor data.

In order to realize the potential of these applications, however, the data from these disparate sources must be cleaned and integrated. In emerging data sources such as the Web and sensors, traditional cleaning and integration techniques are necessary, but not sufficient to deal with the unique challenges presented by this data. I argue that new techniques, based on the concept of pay-as-you-go are crucial for incorporating such data sources into applications. This concept provides a framework for building cleaning and integration solutions that are easy to deploy and maintain, efficiently leverage human feedback where possible, and automatically adapt their processing to the underlying data.

In this thesis, I contribute key building blocks designed to provide pay-as-you-go data cleaning and integration. Specifically, I develop the following techniques: Roomba, a technique for effectively involving user feedback to augment data cleaning mechanisms; Metaphysical Data Independence (MDI), a means of hiding all details of sensor data cleaning and integration under a single interface; SMURF an adaptive cleaning tool for providing MDI for RFID data; and ESP, a declarative-query based cleaning framework for sensor data streams. These techniques all embody key principles that underly the pay-as-you-go philosophy: ease of setup and deployment, adaptability, and incremental integration.

Additionally, I show that a focus on the pay-as-you-go philosophy does not preclude effective data cleaning and integration mechanisms. Indeed, in many cases the techniques developed in this thesis are capable of producing higher-quality data than current cleaning and integration techniques. For instance, effective use of human feedback is able to integrate data in a large-scale data integration scenario with half the human cost of current approaches. Similarly, an adaptive approach to cleaning RFID data is able to produce a three-fold reduction in data error rate in certain scenarios compared to the state-of-the-art RFID middleware solutions.

In summary, this thesis makes two broad contributions. First, it demonstrates that a pay-as-you-go approach to data cleaning and integration enables an emerging class of applications dependent on data derived from many heterogeneous data sources. Second, it proposes a suite of pay-as-you-go based data cleaning and integration techniques that provide a solid foundation on which to build the systems to support these applications.

Details

Title

Pay-as-you-go Data Cleaning and Integration

Creator

Jeffery, Shawn, Author
EECS Department, University of California, Publisher

Published

2008-08-20

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2008-102

Type

Text

Format

technical reports

Extent

191 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket