Backdoor attacks recently brought a new class of deep neural network vulnerabilities to light. In a backdoor attack, an adversary poisons a fraction of the model’s training data with a backdoor trigger, flips those samples’ labels to some target class, and trains the model on this poisoned dataset. By using the same backdoor trigger after an unsuspecting user deploys the model, the adversary gains control over the deep neural network’s behavior. As both theory and practice increasingly turn to transfer learning, where users download and integrate massive pre-trained models into their setups, backdoor attacks present a serious security threat. There are recently published attacks that can survive downstream fine-tuning and even generate context-aware trigger patterns to evade outlier detection defenses.

Inspired by the observation that a backdoor trigger acts as a shortcut that samples can take to cross a deep neural network’s decision boundary, we build off the rich literature connecting a model’s adversarial robustness to its internal structure and show that the same properties can be used to identify whether or not it contains a backdoor trigger. Specifically, we demonstrate that backdooring a deep neural network thins and tilts its decision boundary, resulting in a more sensitive and less robust classifier.

In addition to a simpler proof of concept demonstration for computer vision models on the MNIST dataset, we build an end-to-end pipeline for distinguishing between clean and backdoored models based on their boundary thickness and boundary tilting and evaluate it on the TrojAI competition benchmark for NLP models. We hope that this thesis will advance our understanding of the links between adversarial robustness and defending against backdoor attacks, and also serve to inspire future research exploring the relationship between adversarial perturbations and backdoor triggers.




Download Full History