TY - JOUR
T1 - Optimizing read operations of hadoop distributed file system on heterogeneous storages
AU - Lee, Jongbaeg
AU - Lee, Jongwuk
AU - Lee, Sang Won
N1 - Publisher Copyright:
© 2021 Institute of Information Science. All rights reserved.
PY - 2021/5
Y1 - 2021/5
N2 - The key challenge in big data processing frameworks such as Hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.
AB - The key challenge in big data processing frameworks such as Hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.
KW - Data retrieval policy
KW - Hadoop distributed file system
KW - Heterogeneous storage
KW - Load balancing
KW - MapReduce
UR - https://www.scopus.com/pages/publications/85105928127
U2 - 10.6688/JISE.20210537(3).0013
DO - 10.6688/JISE.20210537(3).0013
M3 - Article
AN - SCOPUS:85105928127
SN - 1016-2364
VL - 37
SP - 709
EP - 729
JO - Journal of Information Science and Engineering
JF - Journal of Information Science and Engineering
IS - 3
ER -