Tensor computations are becoming increasingly important with the emergence of fields such as AI, data analytics, and robotics. Memory access cost is the bottleneck in performance for these workloads. New architectures with specialized memory layouts and parallelizable elements are being designed for faster computation. To fully exploit such an architecture’s capabilities and achieve maximum improvement in performance, an optimal communication avoiding mapping from algorithm to hardware is needed. Manually finding this hardware-specific, energy efficient mapping is time-consuming and requires expertise in multiple domains. Traditional optimization methods like gradient descent are unsuccessful in finding an optimal mapping because the mapping space is non-smooth and non-convex. Other ML based feedback-driven approaches find good solutions, but do not generalise well to new architectures.

In this paper, we propose using GPTune — an autotuning framework based on Bayesian optimization — to navigate this search space. Our experiments show that GPTune finds efficient mappings in far fewer iterations compared to Timeloop-mapper’s random search. GPTune also builds surrogate models that can be used for transfer learning and to potentially reduce the dimensionality of the mapspace. Furthermore, this paper analyses mapspace encodings that work best for tuning.




Download Full History