Combining lexical and statistical translation evidence for cross-language information retrieval

Sungho Kim, Youngjoong Ko, Douglas W. Oard

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

This article explores how best to use lexical and statistical translation evidence together for cross-language information retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine-readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co-occurrence in the document language provides a basis for limiting the adverse effect of translation ambiguity. Coverage statistics for NII Testbeds and Community for Information Access Research (NTCIR) queries confirm that these resources have complementary strengths. Experiments with translation evidence from a small parallel corpus indicate that even rather rough estimates of translation probabilities can yield further improvements over a strong technique for translation weighting based on using Jensen-Shannon divergence as a term-association measure. Finally, a novel approach to posttranslation query expansion using a random walk over the Wikipedia concept link graph is shown to yield further improvements over alternative techniques for posttranslation query expansion. Evaluation results on the NTCIR-5 English-Korean test collection show statistically significant improvements over strong baselines.

Original languageEnglish
Pages (from-to)23-39
Number of pages17
JournalJournal of the Association for Information Science and Technology
Volume66
Issue number1
DOIs
StatePublished - 1 Jan 2015
Externally publishedYes

Keywords

  • information retrieval

Fingerprint

Dive into the research topics of 'Combining lexical and statistical translation evidence for cross-language information retrieval'. Together they form a unique fingerprint.

Cite this