Difference between revisions of "Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia"

Latest revision as of 08:00, 16 January 2021

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia
Authors	Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Publication date	2016
DOI	10.1145/2833089
Links	Original

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia - scientific work related to Wikipedia quality published in 2016, written by Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi.

Overview

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, authors propose an integrated system to extract both parallel sentences and fragments from comparable corpora. Authors first apply parallel sentence extraction to identify parallel sentences from comparable sentences. Authors then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. Authors improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. Authors propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that proposed methods outperform previously proposed methods, and the parallel data extracted by system significantly improves SMT performance.

Embed

Wikipedia Quality

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao. (2016). "[[Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia]]".DOI: 10.1145/2833089.

English Wikipedia

{{cite journal |last1=Chu |first1=Chenhui |last2=Nakazawa |first2=Toshiaki |last3=Kurohashi |first3=Sadao |title=Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia |date=2016 |doi=10.1145/2833089 |url=https://wikipediaquality.com/wiki/Integrated_Parallel_Sentence_and_Fragment_Extraction_from_Comparable_Corpora:_a_Case_Study_on_Chinese--Japanese_Wikipedia}}

HTML

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao. (2016). "<a href="https://wikipediaquality.com/wiki/Integrated_Parallel_Sentence_and_Fragment_Extraction_from_Comparable_Corpora:_a_Case_Study_on_Chinese--Japanese_Wikipedia">Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia</a>".DOI: 10.1145/2833089.

@@ Line 32: / Line 32: @@
 </nowiki>
 </code>
+[[Category:Scientific works]]
+[[Category:Japanese Wikipedia]]
+[[Category:Chinese Wikipedia]]

Difference between revisions of "Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: a Case Study on Chinese--Japanese Wikipedia"

Latest revision as of 08:00, 16 January 2021

Contents

Overview

Embed

Wikipedia Quality

English Wikipedia

HTML

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools