Wikipedia as an SMT Training Corpus

From Wikipedia Quality
Jump to: navigation, search


Wikipedia as an SMT Training Corpus
Authors
Dan Tufiš
Radu Ion
Ştefan Daniel Dumitrescu
Dan C. Ştefǎnescu
Publication date
2013
ISSN
13138502
Links

Wikipedia as an SMT Training Corpus - scientific work about Wikipedia quality published in 2013, written by Dan Tufiš, Radu Ion, Ştefan Daniel Dumitrescu and Dan C. Ştefǎnescu.

Overview

This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on large bilingual corpora of similar sentence pairs extracted from the entire dumps of Wikipedia as of June 2012. Their experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.

Embed

Wikipedia Quality

Tufiš, Dan; Ion, Radu; Dumitrescu, Ştefan Daniel; Ştefǎnescu, Dan C.. (2013). "[[Wikipedia as an SMT Training Corpus]]". Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370. ISSN: 13138502.

English Wikipedia

{{cite journal |last1=Tufiš |first1=Dan |last2=Ion |first2=Radu |last3=Dumitrescu |first3=Ştefan Daniel |last4=Ştefǎnescu |first4=Dan C. |title=Wikipedia as an SMT Training Corpus |date=2013 |issn=13138502 |url=https://wikipediaquality.com/wiki/Wikipedia_as_an_SMT_Training_Corpus |journal=Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370}}

HTML

Tufiš, Dan; Ion, Radu; Dumitrescu, Ştefan Daniel; Ştefǎnescu, Dan C.. (2013). &quot;<a href="https://wikipediaquality.com/wiki/Wikipedia_as_an_SMT_Training_Corpus">Wikipedia as an SMT Training Corpus</a>&quot;. Behaviour and Information Technology Volume 33, Issue 12, 13 December 2014, pp. 1361-1370. ISSN: 13138502.