This thesis presents a novel dataset of 12,000 segmentations of 1,000 natural images by 30 human subjects. The subjects marked the locations of objects in the images, providing ground truth data for learning grouping cues and benchmarking grouping algorithms. We feel that the data-driven approach is critical for two reasons: (1) the data reflects "ecological statistics" that the human visual system has evolved to exploit, and (2) innovations in computational vision should be evaluated quantitatively.

We develop a battery of segmentation comparison measures that we use both to validate the consistency of the human data and to provide approaches for evaluating grouping algorithms. In conjunction with the segmentation dataset, the various measures provide "micro-benchmarks" for boundary detection algorithms and pixel affinity functions, as well a benchmark for complete segmentation algorithms. Using these performance measures, we can systematically improve grouping algorithms with the human ground truth as our goal.

Starting at the lowest level, we present local boundary models based on brightness, color, and texture cues, where the cues are individually optimized with respect to the dataset and then combined in a statistically optimal manner with classifiers. The resulting detector is shown to significantly outperform prior state-of-the-art algorithms. Next, we learn from data how to combine the boundary model with patch-based features in a pixel affinity model to settle long-standing debates in computer vision with empirical results: (1) brightness boundaries are more informative than patches, and vice versa for color; (2) texture boundaries and patches are the two most powerful cues; (3) proximity is not a useful cue for grouping, it is simply a result of the process; and (4) both boundary-based and region-based approaches provide significant independent information for grouping.




Download Full History