A Factory of Comparable Corpora from Wikipedia
Authors | Alberto Barrón-Cedeño Cristina España-Bonet Josu Boldoba Lluís Màrquez |
---|---|
Publication date | 2015 |
DOI | 10.18653/v1/W15-3402 |
Links | Original |
A Factory of Comparable Corpora from Wikipedia - scientific work related to Wikipedia quality published in 2015, written by Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez.
Overview
Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. Authors present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, authors automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Authors experiments on the English‐ Spanish pair in the domains of Computer Science, Science, and Sports show that in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, authors show that these corpora can help when translating out-of-domain texts.
Embed
Wikipedia Quality
Barrón-Cedeño, Alberto; España-Bonet, Cristina; Boldoba, Josu; Màrquez, Lluís. (2015). "[[A Factory of Comparable Corpora from Wikipedia]]". Association for Computational Linguistics. DOI: 10.18653/v1/W15-3402.
English Wikipedia
{{cite journal |last1=Barrón-Cedeño |first1=Alberto |last2=España-Bonet |first2=Cristina |last3=Boldoba |first3=Josu |last4=Màrquez |first4=Lluís |title=A Factory of Comparable Corpora from Wikipedia |date=2015 |doi=10.18653/v1/W15-3402 |url=https://wikipediaquality.com/wiki/A_Factory_of_Comparable_Corpora_from_Wikipedia |journal=Association for Computational Linguistics}}
HTML
Barrón-Cedeño, Alberto; España-Bonet, Cristina; Boldoba, Josu; Màrquez, Lluís. (2015). "<a href="https://wikipediaquality.com/wiki/A_Factory_of_Comparable_Corpora_from_Wikipedia">A Factory of Comparable Corpora from Wikipedia</a>". Association for Computational Linguistics. DOI: 10.18653/v1/W15-3402.