TY - GEN
T1 - BertLoc
T2 - 36th Annual ACM Symposium on Applied Computing, SAC 2021
AU - Park, Sujin
AU - Lee, Sangwon
AU - Woo, Simon S.
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/3/22
Y1 - 2021/3/22
N2 - Due to a significant increase in the number of location services as well as services which rely on location information such as real-time maps, there is an enormous need to provide accurate location information to end users. In order to acquire the location records, generally, users or other systems initiate the location query to the location search engine, and the location search engine provides the best matching results. However, there are often inconsistency, noise, and ambiguity in the location datasets. In particular, there are many cases where the same location is recorded as different names from varying data sources, which can not only confuse users, but also introduce inaccurate results. Therefore, detecting the duplicate location information in a large database as well as accurately merging them into a single location record are critical. In this work, we propose BertLoc, a novel deep learning-based architecture to detect the duplicate location represented in different ways (e.g., Cafe vs. Coffee House) and effectively merge them into a single and consistent location record. BertLoc is based on Multilingual Bert Model followed by BiLSTM and CNN to effectively compare and determine whether given location strings are the same location or not. We evaluate BertLoc trained with more than half a million location data used in real service in South Korea and compare the results with other popular baseline methods. Our experimental results show that BertLoc outperforms other popular baseline methods with 0.952 F1-score, and shows great promise in detecting duplicate records in a large-scale location dataset.
AB - Due to a significant increase in the number of location services as well as services which rely on location information such as real-time maps, there is an enormous need to provide accurate location information to end users. In order to acquire the location records, generally, users or other systems initiate the location query to the location search engine, and the location search engine provides the best matching results. However, there are often inconsistency, noise, and ambiguity in the location datasets. In particular, there are many cases where the same location is recorded as different names from varying data sources, which can not only confuse users, but also introduce inaccurate results. Therefore, detecting the duplicate location information in a large database as well as accurately merging them into a single location record are critical. In this work, we propose BertLoc, a novel deep learning-based architecture to detect the duplicate location represented in different ways (e.g., Cafe vs. Coffee House) and effectively merge them into a single and consistent location record. BertLoc is based on Multilingual Bert Model followed by BiLSTM and CNN to effectively compare and determine whether given location strings are the same location or not. We evaluate BertLoc trained with more than half a million location data used in real service in South Korea and compare the results with other popular baseline methods. Our experimental results show that BertLoc outperforms other popular baseline methods with 0.952 F1-score, and shows great promise in detecting duplicate records in a large-scale location dataset.
KW - BERT
KW - big data
KW - CNN
KW - deep learning
KW - duplicate record detection
KW - location search
KW - LSTM
UR - https://www.scopus.com/pages/publications/85104994476
U2 - 10.1145/3412841.3441969
DO - 10.1145/3412841.3441969
M3 - Conference contribution
AN - SCOPUS:85104994476
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 942
EP - 951
BT - Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PB - Association for Computing Machinery
Y2 - 22 March 2021 through 26 March 2021
ER -