TY - GEN
T1 - Compiler-assisted GPU thread throttling for reduced cache contention
AU - Kim, Hyunjun
AU - Hong, Sungin
AU - Lee, Hyeonsu
AU - Seo, Euiseong
AU - Han, Hwansoo
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/8/5
Y1 - 2019/8/5
N2 - Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.
AB - Modern GPUs concurrently deploy thousands of threads to maximize thread level parallelism (TLP) for performance. For some applications, however, maximized TLP leads to significant performance degradation, as many concurrent threads compete for the limited amount of the data cache. In this paper, we propose a compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance. A few dynamic thread throttling schemes have been proposed to alleviate cache contention by monitoring the cache behavior, but they often fail to provide timely responses to the dynamic changes in the cache behavior, as they adjust the parallelism afterwards in response to the monitored behavior. Our thread throttling scheme relies on compile-time adjustment of active thread groups to fit their memory footprints to the L1D capacity. We evaluated the proposed scheme with GPU programs that suffer from cache contention. Our approach improved the performance of original programs by 42.96% on average, and this is 8.97% performance boost in comparison to the static thread throttling schemes.
KW - Cache Contention
KW - GPGPU
KW - Static Analysis
KW - Thread Throttling
UR - https://www.scopus.com/pages/publications/85071094541
U2 - 10.1145/3337821.3337886
DO - 10.1145/3337821.3337886
M3 - Conference contribution
AN - SCOPUS:85071094541
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PB - Association for Computing Machinery
T2 - 48th International Conference on Parallel Processing, ICPP 2019
Y2 - 5 August 2019 through 8 August 2019
ER -