Functional dependency is one of the most extensively researched subjects in database theory, originally for improving quality of schemas, and recently for improving quality of data. In a pay-as-you-go data integration system, where the goal is to provide best-effort service even without thorough understanding of the underlying domain and the various data sources, functional dependency can play an even more important role, applied in normalizing an automatically generated mediated schema, pinpointing sources of low quality, resolving conflicts in data from different sources, improving efficiency of query answering, and so on. Despite its importance, discovering functional dependencies in such a context is challenging: we cannot assume upfront domain knowledge for specifying dependencies, and the data can be dirty, incomplete, or even misinterpreted, so make automatic discovery of dependencies hard.
This paper studies how one can automatically discover functional dependencies in a pay-as-you-go data integration system. We introduce the notion of probabilistic functional dependencies (pFDs) and design Bayes models that compute probabilities of dependencies according to data from various sources. As an application, we study how to normalize a mediated schema based on the pFDs we generate. Experiments on real-world data sets with tens or hundreds of data sources show that our techniques obtain high precision and recall in dependency discovery and generate high-quality results in mediated-schema normalization.
Discovering Functional Dependencies in Pay-As-You-Go Data Integration Systems
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).