TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs

Minseong Gil, Jaebeom Jeon, Junsu Kim, Sangun Choi, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

Research output: Contribution to journalArticlepeer-review

Abstract

This paper introduces a novel software technique to optimize thread allocation for merged and fused kernels in multi-Tenant inference systems on embedded Graphics Processing Units (GPUs). Embedded systems equipped with GPUs face challenges in managing diverse deep learning workloads while adhering to Quality-of-Service (QoS) standards, primarily due to limited hardware resources and the varied nature of deep learning models. Prior work has relied on static thread allocation strategies, often leading to suboptimal hardware utilization. To address these challenges, we propose a new software technique called TLP Balancer. TLP Balancer automatically identifies the best-performing number of threads based on performance modeling. This approach significantly enhances hardware utilization and ensures QoS compliance, outperforming traditional fixed-Thread allocation methods. Our evaluation shows that TLP Balancer improves throughput by 40% compared to the state-of-The-Art automated kernel merge and fusion techniques.

Original languageEnglish
JournalIEEE Embedded Systems Letters
DOIs
StateAccepted/In press - 2024

Bibliographical note

Publisher Copyright:
© 2009-2012 IEEE.

Keywords

  • Embedded GPU
  • inference
  • multi-Tenancy

Fingerprint

Dive into the research topics of 'TLP Balancer: Predictive Thread Allocation for Multi-Tenant Inference in Embedded GPUs'. Together they form a unique fingerprint.

Cite this