Abstract
Recently, machine learning (ML) workloads across diverse industries such as smart logistics, finance, and entertainment are increasingly being executed on cloud platforms. Efficient scheduling of these ML workloads is challenging as various types of workloads coexist and the cluster systems feature heterogeneous GPU resources. Although task scheduling has been extensively studied, traditional scheduling policies do not perform well in such environments as they cause resource fragmentation problems, which significantly lowers GPU utilization. To address this issue, this paper proposes a new scheduling approach utilizing evolutionary computation techniques, and implements it within a process-based event simulation framework. Experimental results, replicating extensive ML task traces collected from Alibaba's MLaaS cluster, demonstrate that the proposed scheduling approach significantly improves GPU utilization compared to conventional scheduling policies. It is anticipated that the scheduling policies proposed in this paper will be used effectively for the resource allocation of ML workloads in future GPU cluster systems.
Original language | English |
---|---|
Title of host publication | Proceedings - 2024 International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 697-701 |
Number of pages | 5 |
ISBN (Electronic) | 9798350355253 |
DOIs | |
State | Published - 2024 |
Event | 5th International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024 - Dalian, China Duration: 16 Aug 2024 → 18 Aug 2024 |
Publication series
Name | Proceedings - 2024 International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024 |
---|
Conference
Conference | 5th International Conference on Advances in Electrical Engineering and Computer Applications, AEECA 2024 |
---|---|
Country/Territory | China |
City | Dalian |
Period | 16/08/24 → 18/08/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- cloud
- evolutionary computation
- GPU
- machine learning
- task scheduling