Overlay network monitoring enables distributed Internet applications to detect and recover from path outages and periods of degraded performance within seconds. For an overlay network with n end hosts, existing systems either require O(n^2) measurements, and thus lack scalability, or can only estimate the latency but not congestion/failures. Our earlier extended abstract briefly proposes an algebraic approach that selects and monitors k linearly independent paths that can fully describe all the O(n^2) paths. The loss rates (this can also be extended to latency) of these paths can be used to estimate the loss rates of all other paths. Our scheme only assumes knowledge of the underlying IP topology, with links dynamically varying between lossy and normal.

In this paper, we improve, implement and extensively evaluate such a monitoring system. We further make the following contributions: i) scalability analysis which shows that for reasonably large n (e.g., 100), k is only in the range of O(n log n), ii) efficient adaptation algorithms for topology changes, such as the addition/removal of end hosts and interconnections, iii) measurement load balancing scheme for better scalability, iv) topology measurement error handling for better accuracy. Both simulation and Internet experiments demonstrate that we achieve high path loss rate estimation accuracy while adapting to topology changes within seconds and handling topology errors. We can also continuously update the loss rate estimations online. For example, in the Internet experiments, the average update time is 0.16 second for all 2550 paths, the absolute error of loss rate estimation is 0.0027 and the average error factor is 1.1.




Download Full History