TY - GEN
T1 - Hyper-QKSG
T2 - 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, EMNLP 2024
AU - Kim, Dooyoung
AU - Jang, Yoonjin
AU - Shin, Dongwook
AU - Park, Chanhoon
AU - Ko, Youngjoong
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - These days, there is an increasing necessity to provide a user with a short knowledge-snippet for a query in commercial information retrieval services such as the featured snippet of Google. In this paper, we focus on how to automatically extract the candidates of query-knowledge snippet pairs from structured HTML documents by using a new Language Model (HTML-PLM). In particular, the proposed system is powerful on extracting them from Tables and Lists, and provides a new framework for automate query generation and knowledge-snippet extraction based on a QA-pair filtering procedure including the snippet refinement and verification processes, which enhance the quality of generated query-knowledge snippet pairs. As a result, 53.8% of the generated knowledge-snippets includes complex HTML structures such as tables and lists in our experiments of a real-world environments, and 66.5% of the knowledge-snippets are evaluated as valid.
AB - These days, there is an increasing necessity to provide a user with a short knowledge-snippet for a query in commercial information retrieval services such as the featured snippet of Google. In this paper, we focus on how to automatically extract the candidates of query-knowledge snippet pairs from structured HTML documents by using a new Language Model (HTML-PLM). In particular, the proposed system is powerful on extracting them from Tables and Lists, and provides a new framework for automate query generation and knowledge-snippet extraction based on a QA-pair filtering procedure including the snippet refinement and verification processes, which enhance the quality of generated query-knowledge snippet pairs. As a result, 53.8% of the generated knowledge-snippets includes complex HTML structures such as tables and lists in our experiments of a real-world environments, and 66.5% of the knowledge-snippets are evaluated as valid.
UR - https://www.scopus.com/pages/publications/85216776547
U2 - 10.18653/v1/2024.emnlp-industry.100
DO - 10.18653/v1/2024.emnlp-industry.100
M3 - Conference contribution
AN - SCOPUS:85216776547
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
SP - 1351
EP - 1360
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
A2 - Dernoncourt, Franck
A2 - Preotiuc-Pietro, Daniel
A2 - Shimorina, Anastasia
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -