Description
Inspired by the observation that a backdoor trigger acts as a shortcut that samples can take to cross a deep neural network’s decision boundary, we build off the rich literature connecting a model’s adversarial robustness to its internal structure and show that the same properties can be used to identify whether or not it contains a backdoor trigger. Specifically, we demonstrate that backdooring a deep neural network thins and tilts its decision boundary, resulting in a more sensitive and less robust classifier.
In addition to a simpler proof of concept demonstration for computer vision models on the MNIST dataset, we build an end-to-end pipeline for distinguishing between clean and backdoored models based on their boundary thickness and boundary tilting and evaluate it on the TrojAI competition benchmark for NLP models. We hope that this thesis will advance our understanding of the links between adversarial robustness and defending against backdoor attacks, and also serve to inspire future research exploring the relationship between adversarial perturbations and backdoor triggers.