Mixed-precision computation presents opportunities for programmable accelerators to improve performance and energy efficiency while retaining application flexibility. Building on the Hwacha decoupled vector-fetch accelerator, we introduce high-occupancy vector lanes (HOV), a set of mixed-precision hardware optimizations which support dynamic configuration of multiple architectural register widths and high-throughput operations on packed data. We discuss the implications of HOV for the programming model and describe our microarchitectural approach to maximizing register file utilization and datapath parallelism. Using complete VLSI implementations of HOV in a commercial 28nm process technology, featuring a cache-coherent memory hierarchy with L2 caches and simulated LPDDR3 DRAM modules, we quantify the impact of our HOV enhancements on area, performance, and energy consumption compared to the baseline design, a decoupled vector architecture without mixed-precision support. We observe as much as a 64.3% performance gain and a 61.6% energy reduction over the baseline vector machine on half-precision dense matrix multiplication. We then validate the HOV design against the ARM Mali-T628 MP6 GPU by running a suite of microbenchmarks compiled from the same OpenCL source code using our custom HOV-enabled compiler and the ARM stock compiler.





Download Full History