TY - JOUR
T1 - Dialogue response coherency evaluation with feature sensitive negative sample using multi list-wise ranking loss
AU - Hwang, Yeong Jun
AU - Kang, Dongjun
AU - Bak, Jin Yeong
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025/6/15
Y1 - 2025/6/15
N2 - Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.
AB - Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.
KW - Contrastive learning
KW - Dialogue response evaluation
KW - Feature sensitive multi-listwise ranking model
KW - Natural language generation evaluation
KW - Negative sampling
UR - https://www.scopus.com/pages/publications/105001052138
U2 - 10.1016/j.engappai.2025.110609
DO - 10.1016/j.engappai.2025.110609
M3 - Article
AN - SCOPUS:105001052138
SN - 0952-1976
VL - 150
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 110609
ER -