Improving neural machine translation by filtering synthetic parallel data

Guanghao Xu, Youngjoong Ko, Jungyun Seo

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

Synthetic data has been shown to be effective in training state-of -the-art neural machine translation (NMT) systems. Because the synthetic data is of ten generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise-weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 BLEU points for tst2016 and tst2017, respectively.

Original languageEnglish
Article number1213
JournalEntropy
Volume21
Issue number12
DOIs
StatePublished - 1 Dec 2019

Keywords

  • Back translation
  • Bilingual word embeddings
  • Neural machine translation
  • Synthetic data filtering

Fingerprint

Dive into the research topics of 'Improving neural machine translation by filtering synthetic parallel data'. Together they form a unique fingerprint.

Cite this