Publikation
Mining parallel resources for machine translation from comparable corpora
Santanu Pal; Partha Pakray; Alexander Gelbukh; Josef van Genabith
In: Alexander Gelbukh (Hrsg.). Computational Linguistics and Intelligent Text Processing. Pages 534-544, Lecture Notes in Computer Science (LNCS), Vol. 9041, ISBN ISBN 978-3-319-18110-3, Springer, 4/2015.
Zusammenfassung
Abstract Good performance of Statistical Machine Translation (SMT) is usually achieved with
huge parallel bilingual training corpora, because the translations of words or phrases are
computed basing on bilingual data. However, in case of low-resource language pairs such
as English-Bengali, the performance is affected by insufficient amount of bilingual training
data. Recently, comparable corpora became widely considered as valuable resources for
machine translation. Though very few cases of sub-sentential level parallelism are found
between two comparable documents, there are still potential parallel phrases in comparable
corpora. Mining parallel data from comparable corpora is a promising approach to collect
more parallel training data for SMT. In this paper, we propose an automatic alignment of
English-Bengali comparable sentences from comparable documents.