Abstract
This paper introduces a novel software technique to optimize thread allocation for merged and fused kernels in multi-Tenant inference systems on embedded Graphics Processing Units (GPUs). Embedded systems equipped with GPUs face challenges in managing diverse deep learning workloads while adhering to Quality-of-Service (QoS) standards, primarily due to limited hardware resources and the varied nature of deep learning models. Prior work has relied on static thread allocation strategies, often leading to suboptimal hardware utilization. To address these challenges, we propose a new software technique called TLP Balancer. TLP Balancer automatically identifies the best-performing number of threads based on performance modeling. This approach significantly enhances hardware utilization and ensures QoS compliance, outperforming traditional fixed-Thread allocation methods. Our evaluation shows that TLP Balancer improves throughput by 40% compared to the state-of-The-Art automated kernel merge and fusion techniques.
Original language | English |
---|---|
Journal | IEEE Embedded Systems Letters |
DOIs | |
State | Accepted/In press - 2024 |
Bibliographical note
Publisher Copyright:© 2009-2012 IEEE.
Keywords
- Embedded GPU
- inference
- multi-Tenancy