Description
The increasing prevalence of a unified architecture for machine learning, i.e. the transformer, raises an important question: can a single architecture really do it all? Simultaneously, the growing size of datasets and deep learning models has made faster and memory-efficient training crucial. One recently proposed line of work is reversible networks, which leverage reversible transformations to perfectly reconstruct inputs from outputs while requiring very minimal changes to existing architectures. In this work, we present an in-depth analysis of reversible transformers and demonstrate that they can be more accurate, efficient, and fast than their vanilla counterparts. We introduce a new method of reversible backpropagation which is faster and scales better with memory than previous techniques, and also demonstrate new results which show that reversible transformers transfer better to downstream visual tasks.