TY - GEN
T1 - µLayer
T2 - 14th European Conference on Computer Systems, EuroSys 2019
AU - Kim, Youngsok
AU - Kim, Joonsung
AU - Chae, Dongju
AU - Kim, Daehyun
AU - Kim, Jangwoo
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/3/25
Y1 - 2019/3/25
N2 - Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted services depend on fast NN execution for high responsiveness, demanding mobile devices to minimize the NN execution latency by efficiently utilizing their underlying hardware resources. To better utilize the resources, existing mobile NN frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, their performance is still bounded by the performance of the single target processor, so that real-time services such as voice-driven search often fail to react to user requests in time. It is obvious that this problem will become more serious with the introduction of more demanding NN-assisted services. In this paper, we propose µLayer, a low latency on-device inference runtime which significantly improves the latency of NN-assisted services. µLayer accelerates each NN layer by simultaneously utilizing diverse heterogeneous processors on a mobile device and by performing computations using processor-friendly quantization. Two key findings motivate our work: 1) the existing frameworks are limited by single-processor performance as they execute an NN layer using only a single processor, and 2) the CPU and the GPU on the same mobile device achieve comparable computational throughput, making cooperative acceleration highly promising. First, to accelerate an NN layer using both the CPU and the GPU at the same time, µLayer employs a layer distribution mechanism which completely removes redundant computations between the processors. Next, µLayer optimizes the per-processor performance by making the processors utilize different data types that maximize their utilization. In addition, to minimize potential latency increases due to overly aggressive workload distribution, µLayer selectively increases the distribution granularity to divergent layer paths. Our experiments using representative NNs and mobile devices show that µLayer significantly improves the speed and the energy efficiency of on-device inference by up to 69.6% and 58.1%, respectively, over the state-of-the-art NN execution mechanism.
AB - Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted services depend on fast NN execution for high responsiveness, demanding mobile devices to minimize the NN execution latency by efficiently utilizing their underlying hardware resources. To better utilize the resources, existing mobile NN frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, their performance is still bounded by the performance of the single target processor, so that real-time services such as voice-driven search often fail to react to user requests in time. It is obvious that this problem will become more serious with the introduction of more demanding NN-assisted services. In this paper, we propose µLayer, a low latency on-device inference runtime which significantly improves the latency of NN-assisted services. µLayer accelerates each NN layer by simultaneously utilizing diverse heterogeneous processors on a mobile device and by performing computations using processor-friendly quantization. Two key findings motivate our work: 1) the existing frameworks are limited by single-processor performance as they execute an NN layer using only a single processor, and 2) the CPU and the GPU on the same mobile device achieve comparable computational throughput, making cooperative acceleration highly promising. First, to accelerate an NN layer using both the CPU and the GPU at the same time, µLayer employs a layer distribution mechanism which completely removes redundant computations between the processors. Next, µLayer optimizes the per-processor performance by making the processors utilize different data types that maximize their utilization. In addition, to minimize potential latency increases due to overly aggressive workload distribution, µLayer selectively increases the distribution granularity to divergent layer paths. Our experiments using representative NNs and mobile devices show that µLayer significantly improves the speed and the energy efficiency of on-device inference by up to 69.6% and 58.1%, respectively, over the state-of-the-art NN execution mechanism.
UR - https://www.scopus.com/pages/publications/85063919130
U2 - 10.1145/3302424.3303950
DO - 10.1145/3302424.3303950
M3 - Conference contribution
AN - SCOPUS:85063919130
T3 - Proceedings of the 14th EuroSys Conference 2019
BT - Proceedings of the 14th EuroSys Conference 2019
PB - Association for Computing Machinery, Inc
Y2 - 25 March 2019 through 28 March 2019
ER -