Deep neural networks (DNNs) have rapidly advanced the state of the art in many important, difficult problems. However, recent research has shown that they are vulnerable to adversarial examples. Small worst-case perturbations to a DNN model's input can cause it to be processed incorrectly. Subsequent work has proposed a variety of ways to defend DNN models from adversarial examples, but many defenses are not adequately evaluated on general adversaries.

In this dissertation, we present techniques for generating adversarial examples in order to evaluate defenses under a threat model with an adaptive adversary, with a focus on the task of image classification. We demonstrate our techniques on four proposed defenses and identify new limitations in them.

Next, in order to assess the generality of a promising class of defenses based on adversarial training, we exercise defenses on a diverse set of points near benign examples, other than adversarial examples generated by well known attack methods. First, we analyze a neighborhood of examples in a large sample of directions. Second, we experiment with three new attack methods that differ from previous additive gradient based methods in important ways. We find that these defenses are less robust to these new attacks.

Overall, our results show that current defenses perform better on existing well known attacks, which suggests that we have yet to see a defense that can stand up to a general adversary. We hope that this work sheds light for future work on more general defenses.




Download Full History