Abstract
Recent GPUs provisioned with large register files (RFs) cannot fully utilize the bandwidth between the RFs and execution pipelines, as the current policy for allocating operand (OP) collectors defers the RF accesses until all the source OPs become ready. To tackle this issue, this letter introduces a new OP collector allocation mechanism called Triple-A. Triple-A comprises four key operations. First, Triple-A proactively allocates an OP collector (OC) to a warp instruction even if one of its source OPs is not yet ready, taking advantage of GPUs' in-order execution. Second, a computation result can be directly forwarded to an early allocated OC along with a data dependence, reducing OP loading time from the RFs. Third, Triple-A bypasses RF write operations if the forwarded data is not consumed by any other instruction. Finally, the early allocation is further enhanced with latency-aware optimization, alleviating the potential performance degradation caused by allocating OCs aggressively. Together, these techniques synergistically improve the register bank utilization, demonstrating a 14.1% improvement in performance and an 11.8% reduction in RF energy consumption compared to the state-of-the-art GPUs.
Original language | English |
---|---|
Pages (from-to) | 206-209 |
Number of pages | 4 |
Journal | IEEE Embedded Systems Letters |
Volume | 16 |
Issue number | 2 |
DOIs | |
State | Published - 1 Jun 2024 |
Bibliographical note
Publisher Copyright:© 2009-2012 IEEE.
Keywords
- Data forwarding
- graphics processing units (GPUs)
- operand collector (OC)
- register files (RFs)