This paper presents BOOM version 2, an updated version of the Berkeley Out-of-Order Machine. The design exploration was performed through synthesis, place and route using the foundry-provided standard-cell library and the memory compiler in the TSMC 28 nm HPM process (high performance mobile).

BOOM is an open-source processor that implements the RV64G RISC-V Instruction Set Architecture (ISA). Like most contemporary high-performance cores, BOOM is superscalar (able to execute multiple instructions per cycle) and out-of-order (able to execute instructions as their dependencies are resolved and not restricted to their program order). BOOM is implemented as a parameterizable generator written using the Chisel hardware construction language that can used to generate synthesizable implementations targeting both FPGAs and ASICs.

BOOMv2 is an update in which the design effort has been informed by analysis of synthesized, placed and routed data provided by a contemporary industrial tool flow. We also had access to standard single- and dual-ported memory compilers provided by the foundry, allowing us to explore design trade-offs using different SRAM memories and comparing against synthesized flip-flop arrays. The main distinguishing features of BOOMv2 include an updated 3-stage front-end design with a bigger set-associative Branch Target Buffer (BTB); a pipelined register rename stage; split floating point and integer register files; a dedicated floating point pipeline; separate issue windows for floating point, integer, and memory micro-operations; and separate stages for issue-select and register read.

Managing the complexity of the register file was the largest obstacle to improving BOOM's clock frequency. We spent considerable effort on placing-and-routing a semi-custom 9-port register file to explore the potential improvements over a fully synthesized design, in conjunction with microarchitectural techniques to reduce the size and port count of the register file. BOOMv2 has a 37 fanout-of-four (FO4) inverter delay after synthesis and 50 FO4 after place-and-route, a 24% reduction from BOOMv1's 65 FO4 after place-and-route. Unfortunately, instruction per cycle (IPC) performance drops up to 20%, mostly due to the extra latency between load instructions and dependent instructions. However, the new BOOMv2 physical design paves the way for IPC recovery later.




Download Full History