Abstract
Synthetic data has been shown to be effective in training state-of -the-art neural machine translation (NMT) systems. Because the synthetic data is of ten generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise-weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 BLEU points for tst2016 and tst2017, respectively.
| Original language | English |
|---|---|
| Article number | 1213 |
| Journal | Entropy |
| Volume | 21 |
| Issue number | 12 |
| DOIs | |
| State | Published - 1 Dec 2019 |
Keywords
- Back translation
- Bilingual word embeddings
- Neural machine translation
- Synthetic data filtering