As performance improvements from transistor process scaling have slowed, micro-processor designers have increasingly turned to special-purpose accelerators to continue improving the performance of their chips. Most of these accelerators deal with compute-heavy tasks like graphics, audio/video decoding, or cryptography. However, we decided to focus on a memory-bound task: memory to memory copies. Memory to memory copies make up a significant portion of data center workloads, so improving its performance could lead to large savings in operational cost. To that end, we designed a memory copy accelerator which can move data at high bandwidth within the L2 cache and main memory. Unike traditional DMA engines, this copy accelerator is virtual memory-aware and can perform data transfers without any need for page pinning or ahead-of-time page translation. This relieves much of the programming burden from the operating system developer and application programmer. We compared the performance of this accelerator to memcpy() functions implemented with scalar RISC instructions and with vector instructions. Our evaluation showed that the copy accelerator was significantly faster than the scalar implementation for larger transfers, even when accounting for the overhead of page faults. In addition, the copy accelerator matched the performance of the vector implementation, while taking up an order of magnitude less area than the vector coprocessor.




Download Full History