Constructing a Chinese―Japanese Parallel Corpus from Wikipedia

From Wikipedia Quality
Jump to: navigation, search


Constructing a Chinese―Japanese Parallel Corpus from Wikipedia
Authors
Chenhui Chu
Toshiaki Nakazawa
Sadao Kurohashi
Publication date
2014
Links
Original

Constructing a Chinese―Japanese Parallel Corpus from Wikipedia - scientific work related to Wikipedia quality published in 2014, written by Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi.

Overview

Graduate School of Informatics, Kyoto University Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan E-mail: {chu, nakazawa}@nlp.ist.i.kyoto-u.ac.jp, kuro@i.kyoto-u.ac.jp Abstract Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese–Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese–Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. Authors improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, authors construct a Chinese–Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/ ̃chu/resource/wiki_zh_ja.tgz.

Embed

Wikipedia Quality

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao. (2014). "[[Constructing a Chinese―Japanese Parallel Corpus from Wikipedia]]".

English Wikipedia

{{cite journal |last1=Chu |first1=Chenhui |last2=Nakazawa |first2=Toshiaki |last3=Kurohashi |first3=Sadao |title=Constructing a Chinese―Japanese Parallel Corpus from Wikipedia |date=2014 |url=https://wikipediaquality.com/wiki/Constructing_a_Chinese―Japanese_Parallel_Corpus_from_Wikipedia}}

HTML

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao. (2014). &quot;<a href="https://wikipediaquality.com/wiki/Constructing_a_Chinese―Japanese_Parallel_Corpus_from_Wikipedia">Constructing a Chinese―Japanese Parallel Corpus from Wikipedia</a>&quot;.