Difference between revisions of "Mining for Domain-Specific Parallel Text from Wikipedia"

Revision as of 10:38, 8 January 2020

Mining for Domain-Specific Parallel Text from Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Magdalena Plamada and Martin Volk.

Overview

Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.

@@ Line 1: / Line 1: @@
-'''Mining for Domain-Specific Parallel Text from Wikipedia''' - scientific work related to Wikipedia quality published in 2013, written by Magdalena Plamada and Martin Volk.
+'''Mining for Domain-Specific Parallel Text from Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2013, written by [[Magdalena Plamada]] and [[Martin Volk]].
 == Overview ==
-Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.
+Previous attempts in extracting parallel data from [[Wikipedia]] were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.

Difference between revisions of "Mining for Domain-Specific Parallel Text from Wikipedia"

Revision as of 10:38, 8 January 2020

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools