Evolutionary Computation-Based Scheduling of Machine Learning Workloads for GPU Clusters

Seokmin Kwon, Hyokyung Bahn

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, machine learning (ML) workloads across diverse industries such as smart logistics, finance, and entertainment are increasingly being executed on cloud platforms. Efficient scheduling of these ML workloads is challenging as various types of workloads coexist and the cluster systems feature heterogeneous GPU resources. Although task scheduling has been extensively studied, traditional scheduling policies do not perform well in such environments as they cause resource fragmentation problems, which significantly lowers GPU utilization. To address this issue, this paper proposes a new scheduling approach utilizing evolutionary computation techniques, and implements it within a process-based event simulation framework. Experimental results, replicating extensive ML task traces collected from Alibaba's MLaaS cluster, demonstrate that the proposed scheduling approach significantly improves GPU utilization compared to conventional scheduling policies. It is anticipated that the scheduling policies proposed in this paper will be used effectively for the resource allocation of ML workloads in future GPU cluster systems.

Original languageEnglish
Title of host publicationProceedings - 2024 International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages697-701
Number of pages5
ISBN (Electronic)9798350355253
DOIs
StatePublished - 2024
Event5th International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024 - Dalian, China
Duration: 16 Aug 202418 Aug 2024

Publication series

NameProceedings - 2024 International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024

Conference

Conference5th International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024
Country/TerritoryChina
CityDalian
Period16/08/2418/08/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • cloud
  • evolutionary computation
  • GPU
  • machine learning
  • task scheduling

Fingerprint

Dive into the research topics of 'Evolutionary Computation-Based Scheduling of Machine Learning Workloads for GPU Clusters'. Together they form a unique fingerprint.

Cite this