In this dissertation, we present techniques for generating adversarial examples in order to evaluate defenses under a threat model with an adaptive adversary, with a focus on the task of image classification. We demonstrate our techniques on four proposed defenses and identify new limitations in them.
Next, in order to assess the generality of a promising class of defenses based on adversarial training, we exercise defenses on a diverse set of points near benign examples, other than adversarial examples generated by well known attack methods. First, we analyze a neighborhood of examples in a large sample of directions. Second, we experiment with three new attack methods that differ from previous additive gradient based methods in important ways. We find that these defenses are less robust to these new attacks.
Overall, our results show that current defenses perform better on existing well known attacks, which suggests that we have yet to see a defense that can stand up to a general adversary. We hope that this work sheds light for future work on more general defenses.