Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
Authors	Ling-Xiang Tang Shlomo Geva Yue Xu Andrew Trotman
Publication date	2009
Links	Original Preprint

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information - scientific work related to Wikipedia quality published in 2009, written by Ling-Xiang Tang, Shlomo Geva, Yue Xu and Andrew Trotman.

Overview

In this paper, authors propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.

Embed

Wikipedia Quality

Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). "[[Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information]]". School of Information Technologies, University of Sydney.

English Wikipedia

{{cite journal |last1=Tang |first1=Ling-Xiang |last2=Geva |first2=Shlomo |last3=Xu |first3=Yue |last4=Trotman |first4=Andrew |title=Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information |date=2009 |url=https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information |journal=School of Information Technologies, University of Sydney}}

HTML

Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). "<a href="https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information">Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information</a>". School of Information Technologies, University of Sydney.

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

Contents

Overview

Embed

Wikipedia Quality

English Wikipedia

HTML

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools