Description
Distillation is a common tool to compress models, accelerate training, and improve model performance. Often a model trained via distillation is able to achieve accuracy exceeding that of a model with the same architecture but trained from scratch. However, we surprisingly find that distillation incurs significant accuracy penalties for EfficientNet and MobileNet. We offer a hypothesis as to why this happens as well as Masked Layer Distillation, a new training algorithm that recovers a significant amount of this performance loss and also translates well to other models such as ResNets and VGGs. As an additional benefit, we also find that our method accelerates training by 2x to 5x and is robust to adverse initialization schemes.