Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.
|Title of host publication||2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2021|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|State||Published - 2021|
|Event||2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2021 - Brisbane, Australia|
Duration: 8 Dec 2021 → 10 Dec 2021
|Name||2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2021|
|Conference||2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2021|
|Period||8/12/21 → 10/12/21|
Bibliographical noteFunding Information:
ACKNOWLEDGMENT This work was supported by the ICT R&D program of MSIP/IITP (2018-0-00549, extremely scalable order preserving OS for manycore and non-volatile memory) and (2019-0-00074, developing system software technologies for emerging new memory that adaptively learn workload characteristics). Hyokyung Bahn is the corresponding author of this paper.
© IEEE 2022.
- load balancing
- resource utilization
- thread block scheduler