Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs

Jaebeom Jeon, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve 2.7× better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.

Original languageEnglish
Pages (from-to)421-424
Number of pages4
JournalIEEE Embedded Systems Letters
Volume16
Issue number4
DOIs
StatePublished - 2024

Bibliographical note

Publisher Copyright:
© 2009-2012 IEEE.

Keywords

  • Embedded graphics processing unit (GPU)
  • inference
  • multitenancy

Fingerprint

Dive into the research topics of 'Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs'. Together they form a unique fingerprint.

Cite this