In real-world data science and machine learning, data are inevitably imperfect. Data contamination comes in many sources. It may come from human errors which can be avoided with more caution. However, it may also come from sources such as systematic measurement errors and adversarial data poisoning that are hard to avoid and even detect. Consequently, there is a need for methods that can perform certain tasks in statistics despite this difficulty. Formally speaking, we want to design efficient algorithms that can provide provable guarantees for learning problems under certain models of contamination.

In the article, we examine some important techniques in the recent development of efficient algorithms for robust statistics, namely filtering-based methods and sum-of-squares techniques. Specifically, we will focus on the problem of learning linear models (including linear regression, generalized linear models etc.) under the strong contamination model. We will fully present and analyze the conditions and consequences of SEVER [DKK+19] and the sum-of-squares-based algorithm for robust linear regression in [KKM20]. SEVER is meta-algorithm that takes in a well-conditioned base learner and output a outlier-robust version of the base learners. The [KKM20] robust linear regression algorithm is an elegant and simple application of sum-of-squares techniques for robust regressions in general including l1, l2 and polynomial regression. Both algorithms have O(√ε)-dependence in error on the fraction of outlier ε. We will present and prove the theoretical guarantees of these algorithms which shed lights on future directions in which the error dependence and the required assumptions can be improved.




Download Full History