Wikipedia as an SMT Training Corpus
Authors | Dan Tufiš Radu Ion Ştefan Daniel Dumitrescu Dan C. Ştefǎnescu |
---|---|
Publication date | 2013 |
ISSN | 13138502 |
Links |
Wikipedia as an SMT Training Corpus - scientific work about Wikipedia quality published in 2013, written by Dan Tufiš, Radu Ion, Ştefan Daniel Dumitrescu and Dan C. Ştefǎnescu.
Overview
This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on large bilingual corpora of similar sentence pairs extracted from the entire dumps of Wikipedia as of June 2012. Their experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.
Embed
Wikipedia Quality
Tufiš, Dan; Ion, Radu; Dumitrescu, Ştefan Daniel; Ştefǎnescu, Dan C.. (2013). "[[Wikipedia as an SMT Training Corpus]]". Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370. ISSN: 13138502.
English Wikipedia
{{cite journal |last1=Tufiš |first1=Dan |last2=Ion |first2=Radu |last3=Dumitrescu |first3=Ştefan Daniel |last4=Ştefǎnescu |first4=Dan C. |title=Wikipedia as an SMT Training Corpus |date=2013 |issn=13138502 |url=https://wikipediaquality.com/wiki/Wikipedia_as_an_SMT_Training_Corpus |journal=Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370}}
HTML
Tufiš, Dan; Ion, Radu; Dumitrescu, Ştefan Daniel; Ştefǎnescu, Dan C.. (2013). "<a href="https://wikipediaquality.com/wiki/Wikipedia_as_an_SMT_Training_Corpus">Wikipedia as an SMT Training Corpus</a>". Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370. ISSN: 13138502.