Dialogue response coherency evaluation with feature sensitive negative sample using multi list-wise ranking loss

Yeong Jun Hwang, Dongjun Kang, Jin Yeong Bak

Research output: Contribution to journalArticlepeer-review

Abstract

Automatic evaluation of dialogue coherency is crucial for developing high-quality dialogue systems. However, traditional evaluation metrics such as Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) have limitations when it comes to assessing diverse and creative responses because they heavily rely on reference responses. For learnable metrics which utilize contrastive learning, challenges are encountered due to the use of randomly selected negative samples that do not reflect conversational features (i.e. topic, emotion, intention) and the lack of granularity in assessing response appropriateness. To address these limitations, we propose the Feature sensitive Multi-Listwise Ranking (FMListR) response coherency evaluation model. This model aims to evaluate dialogue coherency in degrees while considering conversational sensitive features. This approach involves sampling feature-sensitive responses that share conversational features with ground truth responses and utilizing them as hard negative samples. The model is trained using Multi-Listwise Ranking (MListR) loss, which is designed to learn the ranking between negative samples and identify response features. The experimental results demonstrate that Feature sensitive Multi-Listwise Ranking exhibits stronger correlations with human judgment compared to other response coherency evaluation metrics. By considering conversational features and training the model using a specialized loss function, FMListR provides a more robust and accurate evaluation of dialogue coherency.

Original languageEnglish
Article number110609
JournalEngineering Applications of Artificial Intelligence
Volume150
DOIs
StatePublished - 15 Jun 2025

Keywords

  • Contrastive learning
  • Dialogue response evaluation
  • Feature sensitive multi-listwise ranking model
  • Natural language generation evaluation
  • Negative sampling

Fingerprint

Dive into the research topics of 'Dialogue response coherency evaluation with feature sensitive negative sample using multi list-wise ranking loss'. Together they form a unique fingerprint.

Cite this