The answer to many questions in the artificial intelligence realm has been a data-dependent solution, supplemented by deep neural networks (DNNs) as data processors. With an optimal combination of settings, dataset, and architecture, deep networks were introduced to mimic the human brain through artificial neurons, performing a variety of tasks in computer vision (CV) and natural language processing (NLP). They serve as tools in data analytics applications including self-driving, language translation services, medical diagnosis, stock market trading signals and more. It is natural to assume that a network’s representational power must scale in complexity with the tasks or dataset it processes. In practice, however, increasing the amount of data or number of layers and parameters is not always the answer. In certain resource-constrained settings, training deep networks for an extended period of time is not only intractable but also unfavorable; redundancies in the network architecture also have the potential to negatively impact test time performance.

This motivates a more comprehensive view of the inner workings of a deep neural network, taking a deep dive into each of its components. A common approach is to directly examine its weights, but the tradeoff is potentially missing out on information about the network structure. To take the middle ground, we analyze structural characteristics arising from layerwise spectral distributions in order to explain network performance and inform training procedures. We find that (1) allocating learning rate across layers based on measurements of their spectral distribution results in more improvement on “vanilla” architectures such as VGG19, i.e. networks without built-in interactions among layers; and (2) using the same measurements to inform channel pruning on DenseNet40 leads to our model implicitly gaining self-awareness of its “bottleneck” layers to maintain higher accuracies.




Download Full History