The growth of machine learning workloads, specifically deep neural networks (DNNs), in both warehouse scale computing (WSC) and on-edge mobile computing has driven a huge demand in different types of accelerators. This project focuses on exploring the different levels of parallelism when running deep learning inferences on heterogeneous architectures and characterization of coordinating unique accelerators with varying workloads. We have implemented an accelerated depthwise convolution kernel on a vector accelerator and explored the design space of executing MobileNetv2 in different configurations on an architecture consisting of both a systolic and vector accelerator. This work examines shared resource contention at the memory level with this given architecture and analyzes the effects of model pipelining and batch parallelism. Through layer by layer performance and cache analysis we examine the best parameters and configurations to execute MobileNetv2 inference, observing a 1.4x and 3.5x speedup over a naively accelerated baseline on single core and multi core SoCs.




Download Full History