On the Clustering of Web Content for Efficient Replication

Katz, Randy H.; Computer Science Division; Chen, Yan; Qiu, Lili; Chen, Weiyu; Nguyen, Luan

PDF

Description

Recently there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. In this paper, we first compare the un-cooperative pull-based replication of Web contents used by commercial CDNs with the cooperative push-based approach. Our results show that the latter can achieve comparable users' perceived performance with much less replication and update traffic ( 4 - 5% of those in the former scheme). Motivated by the observation, we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60 - 70% reduction in clients' latency compared to replicating in units of Web sites. On the other hand, it is very expensive to perform such a fine-grained replication.

To address this issue, we propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques, and use various topologies and several real traces from large Web servers to evaluate their performance. Our results show that clusterbased replication achieves 40 - 60% improvement over full Web site replication. In addition, by adjusting the number of clusters, we can smoothly trade off the management and computation cost for better client performance.

To take into account of change in users' access patterns, we also explore incremental clusterings to adaptively add new documents to the content clusters. We examine both offline and online incremental clusterings, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clusterings yield close to the performance of the complete re-clustering while at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6 - 8 times compared to no replication and random replication, so it is especially useful to improve document availability during flash crowds.

Details

Title

On the Clustering of Web Content for Efficient Replication

Creator

Katz, Randy H., Author
Computer Science Division, Publisher
Chen, Yan, Author
Qiu, Lili, Author
Chen, Weiyu, Author
Nguyen, Luan, Author

Published

1905-06-24

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

CSD-02-1193

Type

Text

Format

technical reports

Extent

21 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket