Skip to main navigation Skip to search Skip to main content

BertLoc: Duplicate location record detection in a large-scale location dataset

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Due to a significant increase in the number of location services as well as services which rely on location information such as real-time maps, there is an enormous need to provide accurate location information to end users. In order to acquire the location records, generally, users or other systems initiate the location query to the location search engine, and the location search engine provides the best matching results. However, there are often inconsistency, noise, and ambiguity in the location datasets. In particular, there are many cases where the same location is recorded as different names from varying data sources, which can not only confuse users, but also introduce inaccurate results. Therefore, detecting the duplicate location information in a large database as well as accurately merging them into a single location record are critical. In this work, we propose BertLoc, a novel deep learning-based architecture to detect the duplicate location represented in different ways (e.g., Cafe vs. Coffee House) and effectively merge them into a single and consistent location record. BertLoc is based on Multilingual Bert Model followed by BiLSTM and CNN to effectively compare and determine whether given location strings are the same location or not. We evaluate BertLoc trained with more than half a million location data used in real service in South Korea and compare the results with other popular baseline methods. Our experimental results show that BertLoc outperforms other popular baseline methods with 0.952 F1-score, and shows great promise in detecting duplicate records in a large-scale location dataset.

Original languageEnglish
Title of host publicationProceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PublisherAssociation for Computing Machinery
Pages942-951
Number of pages10
ISBN (Electronic)9781450381048
DOIs
StatePublished - 22 Mar 2021
Event36th Annual ACM Symposium on Applied Computing, SAC 2021 - Virtual, Online, Korea, Republic of
Duration: 22 Mar 202126 Mar 2021

Publication series

NameProceedings of the ACM Symposium on Applied Computing

Conference

Conference36th Annual ACM Symposium on Applied Computing, SAC 2021
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period22/03/2126/03/21

Keywords

  • BERT
  • big data
  • CNN
  • deep learning
  • duplicate record detection
  • location search
  • LSTM

Fingerprint

Dive into the research topics of 'BertLoc: Duplicate location record detection in a large-scale location dataset'. Together they form a unique fingerprint.

Cite this