Difference between revisions of "Mining for Domain-Specific Parallel Text from Wikipedia"

From Wikipedia Quality
Jump to: navigation, search
(Wikilinks)
(+ Infobox work)
Line 1: Line 1:
 +
{{Infobox work
 +
| title = Mining for Domain-Specific Parallel Text from Wikipedia
 +
| date = 2013
 +
| authors = [[Magdalena Plamada]]<br />[[Martin Volk]]
 +
| link = http://aclweb.org/anthology/W15-4623
 +
| plink = https://pdfs.semanticscholar.org/b479/b7e0581fdf576ff55c6cbe687d41324dc2d3.pdf
 +
}}
 
'''Mining for Domain-Specific Parallel Text from Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2013, written by [[Magdalena Plamada]] and [[Martin Volk]].
 
'''Mining for Domain-Specific Parallel Text from Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2013, written by [[Magdalena Plamada]] and [[Martin Volk]].
  
 
== Overview ==
 
== Overview ==
 
Previous attempts in extracting parallel data from [[Wikipedia]] were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.
 
Previous attempts in extracting parallel data from [[Wikipedia]] were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.

Revision as of 08:14, 26 May 2020


Mining for Domain-Specific Parallel Text from Wikipedia
Authors
Magdalena Plamada
Martin Volk
Publication date
2013
Links
Original Preprint

Mining for Domain-Specific Parallel Text from Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Magdalena Plamada and Martin Volk.

Overview

Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.