Description
Recent trends in self-supervised representation learning have focused on removing inductive biases from the training process. However, inductive biases can be useful in certain settings, such as medical imaging, where domain expertise can help define a prior over semantic structure. We present Medical DINO (MeDINO), a method that takes advantage of consistent spatial and semantic structure in unlabeled medical imaging datasets to guide vision transformer attention. MeDINO operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or are provided via a single labeled sample from a domain expert. Using chest X-ray radiographs as a primary case study, we show that the resulting attention masks are more interpretable than those resulting from domain-agnostic pretraining, producing a 58.7 mAP improvement for lung and heart segmentation following the self- supervised pretraining. Additionally, our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task.