Summarizing large-scale, multiple-document news data: sparse methods &amp; human validation

Miratrix, Luke; Jia, Jinzhu; Gawalt, Brian; Yu, Bin; El Ghaoui, Laurent

PDF

Description

News media significantly drives the course of events. Understanding how has long been an active and important area of research. Now, as the amount of online news media available grows, there is even more information calling for analysis, an ever increasing range of inquiry that one might conduct. We believe subject-specific summarization of multiple news documents at once can help. In this paper we adapt scalable statistical techniques to perform this summarization under a predictive framework using a vector space model of documents. We reduce corpora of many millions of words to a few representative key-phrases that describe a specified subject of interest. We propose this as a tool for news media study.We consider the efficacies of four different feature selection approaches---phrase co-occurrence, phrase correlation, $L^1$ regularized logistic regression (L1LR), and $L^1$ regularized linear regression (Lasso)---under many different pre-processing choices. To evaluate these different summarizers we establish a survey by which non-expert human readers rate generated summaries. Data pre-processing decisions are important; we also study the impact of several different techniques for vectorizing the documents, and identifying which documents concern a subject.We find that the Lasso, which consistently produces high-quality summaries across the many pre-processing schemes and subjects, is the best choice of feature selection engine. Our findings also reinforce the many years of work suggesting the tf-idf representation is a strong choice of vector space, but only for longer units of text.Though we focus here on print media (newspapers), our methods are general and could be applied to any corpora, even ones of considerable size.

Details

Title

Summarizing large-scale, multiple-document news data: sparse methods & human validation

Creator

Miratrix, Luke, Author
Jia, Jinzhu, Author
Gawalt, Brian, Author
Yu, Bin, Author
El Ghaoui, Laurent, Author

Published

Statistics Department, University of California, Berkeley, University of California at Berkeley, Berkeley, California, May 2011

Full Collection Name

Statistics Technical Reports

Other Identifiers

801

Type

Text

Format

technical reports

Extent

30 pages

Archive

Mathematics Statistics Library

Standard Rights Statement

Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user.

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

Statistics Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket