Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems

  • Youngjae Chang
  • , Dooyoung Kim
  • , Jinyoung Kim
  • , Hyunmook Cha
  • , Keunha Kim
  • , Sooyoung Min
  • , Youngjoong Ko
  • , Kye Hwan Lee
  • , Joonwoo Park

Research output: Contribution to conferencePaperpeer-review

Abstract

The Situated Interactive MultiModal Conversations (SIMMC2.1) Challenge 2022 is hosted by the Eleventh Dialog System Technology Challenge (DSTC11). The task of SIMMC is to create a shopping assistant agent that can communicate with customers in a virtual store. It requires processing store scenes and product catalogs along with the customer’s request which could be decomposed into four steps and each becomes a subtask. In this work, we investigate monolithic transformers, fusion transformers, and language transformers as three distinct multimodal modeling approaches, and evaluate the potential of each. We also devise a retrieval-based method to acquire meta-data of each object which enhances the accuracy of predicted object characteristics significantly. Furthermore, we identify a discrepancy in using pretrained language models for dialog tasks and propose a simple domain-adaptation method. Our model came in third place for object coreferencing, dialog state tracking, and response generation tasks.

Original languageEnglish
Pages25-30
Number of pages6
StatePublished - 2023
Event11th Dialog System Technology Challenge, DSTC 2023 - Prague, Czech Republic
Duration: 11 Sep 2023 → …

Conference

Conference11th Dialog System Technology Challenge, DSTC 2023
Country/TerritoryCzech Republic
CityPrague
Period11/09/23 → …

Fingerprint

Dive into the research topics of 'Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems'. Together they form a unique fingerprint.

Cite this