A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters

Seungmin Oh, Kyeonglok Kim, Euiseong Seo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE International Conference on Smart Cloud, SmartCloud 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages165-168
Number of pages4
ISBN (Electronic)9781728165479
DOIs
StatePublished - Nov 2020
Event5th IEEE International Conference on Smart Cloud, SmartCloud 2020 - Washington, United States
Duration: 6 Nov 20208 Nov 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Smart Cloud, SmartCloud 2020

Conference

Conference5th IEEE International Conference on Smart Cloud, SmartCloud 2020
Country/TerritoryUnited States
CityWashington
Period6/11/208/11/20

Keywords

  • cloud computing
  • deep neural network
  • distributed training
  • GPU computing
  • training clusters

Fingerprint

Dive into the research topics of 'A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters'. Together they form a unique fingerprint.

Cite this