Learning from Subsampled Data: Active and Randomized Strategies

Wauthier, Fabian; EECS Department, University of California

PDF

Description

In modern statistical applications, we are often faced with situations where there is either too little or too much data. Both extremes can be troublesome: Interesting models can only be learnt when sufficient amounts of data are available, yet these models tend to become intractable when data is abundant. An important thread of research addresses these difficulties by subsampling the data prior to learning a model. Subsampling can be active (i.e. active learning) or randomized. While both of these techniques have a long history, a direct application to novel situations is in many cases problematic. This dissertation addresses some of these issues. We begin with an active learning strategy for spectral clustering when the cost of assessing individual similarities is substantial or prohibitive. We give an active spectral clustering algorithm which iteratively adds similarities based on information gleaned from a partial clustering and which improves over common alternatives. Next, we consider active learning in Bayesian models. Complex Bayesian models often require an MCMC-based method for inference, which makes a naive application of common active learning strategies intractable. We propose an approximate active learning method which reuses samples from an existing MCMC chain in order to speed up the computations. Our third contribution looks at the effects of randomized subsampling on Gaussian process models that make predictions about outliers and rare events. Randomized subsampling risks making outliers even rarer, which, in the context of Gaussian process models, can lead to overfitting. We show that Heavy-tailed stochastic processes can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. Finally, we turn to a theoretical evaluation of randomized subsampling for the purpose of inferring rankings of objects. We present two simple algorithms that predict a total order over n objects from a randomized subsample of binary comparisons. In expectation, the algorithms match an Omega(n) lower bound on the sample complexity for predicting a permutation with fixed expected Kendall tau distance. Furthermore, we show that given O(nlog(n)) samples, one algorithm recovers the true ranking with uniform quality, while the other predicts the ranking more accurately near the top than the bottom. Due to their simple form, the algorithms can be easily extended to online and distributed settings.

Details

Title

Learning from Subsampled Data: Active and Randomized Strategies

Creator

Wauthier, Fabian, Author
EECS Department, University of California, Publisher

Published

2013-05-17

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2013-94

Type

Text

Format

technical reports

Extent

99 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket