TY - GEN
T1 - A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters
AU - Oh, Seungmin
AU - Kim, Kyeonglok
AU - Seo, Euiseong
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.
AB - The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.
KW - cloud computing
KW - deep neural network
KW - distributed training
KW - GPU computing
KW - training clusters
UR - https://www.scopus.com/pages/publications/85098527303
U2 - 10.1109/SmartCloud49737.2020.00039
DO - 10.1109/SmartCloud49737.2020.00039
M3 - Conference contribution
AN - SCOPUS:85098527303
T3 - Proceedings - 2020 IEEE International Conference on Smart Cloud, SmartCloud 2020
SP - 165
EP - 168
BT - Proceedings - 2020 IEEE International Conference on Smart Cloud, SmartCloud 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Smart Cloud, SmartCloud 2020
Y2 - 6 November 2020 through 8 November 2020
ER -