Difference between revisions of "Mining for Domain-Specific Parallel Text from Wikipedia"
(+ Infobox work) |
(+ Embed) |
||
Line 10: | Line 10: | ||
== Overview == | == Overview == | ||
Previous attempts in extracting parallel data from [[Wikipedia]] were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data. | Previous attempts in extracting parallel data from [[Wikipedia]] were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data. | ||
+ | |||
+ | == Embed == | ||
+ | === Wikipedia Quality === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | Plamada, Magdalena; Volk, Martin. (2013). "[[Mining for Domain-Specific Parallel Text from Wikipedia]]". | ||
+ | </nowiki> | ||
+ | </code> | ||
+ | |||
+ | === English Wikipedia === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | {{cite journal |last1=Plamada |first1=Magdalena |last2=Volk |first2=Martin |title=Mining for Domain-Specific Parallel Text from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Mining_for_Domain-Specific_Parallel_Text_from_Wikipedia}} | ||
+ | </nowiki> | ||
+ | </code> | ||
+ | |||
+ | === HTML === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | Plamada, Magdalena; Volk, Martin. (2013). &quot;<a href="https://wikipediaquality.com/wiki/Mining_for_Domain-Specific_Parallel_Text_from_Wikipedia">Mining for Domain-Specific Parallel Text from Wikipedia</a>&quot;. | ||
+ | </nowiki> | ||
+ | </code> |
Revision as of 07:45, 25 June 2020
Authors | Magdalena Plamada Martin Volk |
---|---|
Publication date | 2013 |
Links | Original Preprint |
Mining for Domain-Specific Parallel Text from Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Magdalena Plamada and Martin Volk.
Overview
Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, authors limit the search space to a specific topical domain, since final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.
Embed
Wikipedia Quality
Plamada, Magdalena; Volk, Martin. (2013). "[[Mining for Domain-Specific Parallel Text from Wikipedia]]".
English Wikipedia
{{cite journal |last1=Plamada |first1=Magdalena |last2=Volk |first2=Martin |title=Mining for Domain-Specific Parallel Text from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Mining_for_Domain-Specific_Parallel_Text_from_Wikipedia}}
HTML
Plamada, Magdalena; Volk, Martin. (2013). "<a href="https://wikipediaquality.com/wiki/Mining_for_Domain-Specific_Parallel_Text_from_Wikipedia">Mining for Domain-Specific Parallel Text from Wikipedia</a>".