Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

From Wikipedia Quality
Jump to: navigation, search


Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
Authors
Ling-Xiang Tang
Shlomo Geva
Yue Xu
Andrew Trotman
Publication date
2009
Links
Original Preprint

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information - scientific work related to Wikipedia quality published in 2009, written by Ling-Xiang Tang, Shlomo Geva, Yue Xu and Andrew Trotman.

Overview

In this paper, authors propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.

Embed

Wikipedia Quality

Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). "[[Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information]]". School of Information Technologies, University of Sydney.

English Wikipedia

{{cite journal |last1=Tang |first1=Ling-Xiang |last2=Geva |first2=Shlomo |last3=Xu |first3=Yue |last4=Trotman |first4=Andrew |title=Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information |date=2009 |url=https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information |journal=School of Information Technologies, University of Sydney}}

HTML

Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). &quot;<a href="https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information">Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information</a>&quot;. School of Information Technologies, University of Sydney.