TY - GEN
T1 - FineReg
T2 - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
AU - Oh, Yunho
AU - Yoon, Myung Kuk
AU - Song, William J.
AU - Ro, Won Woo
N1 - Funding Information:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2018R1A2A2A05018941), and by the TechnologyInnovation Program (No. 10080674, Development of Reconfigurable Artificial Neural Network Accelerator and Instruction Set Architecture) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) and Korea Semiconductor Research Consortium (KSRC) support program for the development of the future semiconductor device. Won Woo Ro and William J. Song are the co-corresponding authors.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/12/12
Y1 - 2018/12/12
N2 - Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8% of performance improvement over a conventional GPU architecture.
AB - Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8% of performance improvement over a conventional GPU architecture.
KW - GPU
KW - Performance
KW - Register File
KW - Thread-Level Parallelism
UR - http://www.scopus.com/inward/record.url?scp=85060022541&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2018.00037
DO - 10.1109/MICRO.2018.00037
M3 - Conference contribution
AN - SCOPUS:85060022541
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 364
EP - 376
BT - Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
PB - IEEE Computer Society
Y2 - 20 October 2018 through 24 October 2018
ER -