TY - JOUR
T1 - Beyond VABlock
T2 - Improving Transformer workloads through aggressive prefetching
AU - Rhee, Jane
AU - Choi, Ikyoung
AU - Koo, Gunjae
AU - Oh, Yunho
AU - Yoon, Myung Kuk
N1 - Publisher Copyright:
© 2025
PY - 2025/5
Y1 - 2025/5
N2 - The memory capacity constraint of GPUs is a major challenge in running large deep learning workloads with their ever increasing memory requirements. To run a large Transformer model with limited GPU memory, programmers need to manually allocate and copy data between CPU and GPUs. This programming burden is eased by Unified Virtual Memory (UVM), which automatically manages data transfer through its demand paging scheme. However, using UVM can cause performance degradation, especially under memory oversubscription. In this paper, we analyze the memory behavior of inference in large Transformer models using real hardware and the open-source NVIDIA UVM driver. The default Tree-Based Neighborhood (TBN) prefetcher in the UVM driver supports page prefetching within a 2MB virtual address block (VABlock), but it only detects locality within a VABlock, limiting its effectiveness for large models. Our analysis reveals that this locality extends beyond the VABlock, which the default prefetcher cannot exploit. To address this, we propose a block-aware prefetcher that prefetches multiple contiguous VABlocks with greater aggressiveness. Our evaluation shows that this approach delivers an average 2.7x performance improvement over the default TBN prefetcher when GPU memory is oversubscribed.
AB - The memory capacity constraint of GPUs is a major challenge in running large deep learning workloads with their ever increasing memory requirements. To run a large Transformer model with limited GPU memory, programmers need to manually allocate and copy data between CPU and GPUs. This programming burden is eased by Unified Virtual Memory (UVM), which automatically manages data transfer through its demand paging scheme. However, using UVM can cause performance degradation, especially under memory oversubscription. In this paper, we analyze the memory behavior of inference in large Transformer models using real hardware and the open-source NVIDIA UVM driver. The default Tree-Based Neighborhood (TBN) prefetcher in the UVM driver supports page prefetching within a 2MB virtual address block (VABlock), but it only detects locality within a VABlock, limiting its effectiveness for large models. Our analysis reveals that this locality extends beyond the VABlock, which the default prefetcher cannot exploit. To address this, we propose a block-aware prefetcher that prefetches multiple contiguous VABlocks with greater aggressiveness. Our evaluation shows that this approach delivers an average 2.7x performance improvement over the default TBN prefetcher when GPU memory is oversubscribed.
KW - Demand paging
KW - Graphics processing units
KW - Large language models
KW - Memory oversubscription
KW - Prefetching
KW - Real-time analysis
KW - Unified virtual memory
UR - https://www.scopus.com/pages/publications/86000770101
U2 - 10.1016/j.sysarc.2025.103389
DO - 10.1016/j.sysarc.2025.103389
M3 - Article
AN - SCOPUS:86000770101
SN - 1383-7621
VL - 162
JO - Journal of Systems Architecture
JF - Journal of Systems Architecture
M1 - 103389
ER -