Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
Authors | Ling-Xiang Tang Shlomo Geva Yue Xu Andrew Trotman |
---|---|
Publication date | 2009 |
Links | Original Preprint |
Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information - scientific work related to Wikipedia quality published in 2009, written by Ling-Xiang Tang, Shlomo Geva, Yue Xu and Andrew Trotman.
Overview
In this paper, authors propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.
Embed
Wikipedia Quality
Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). "[[Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information]]". School of Information Technologies, University of Sydney.
English Wikipedia
{{cite journal |last1=Tang |first1=Ling-Xiang |last2=Geva |first2=Shlomo |last3=Xu |first3=Yue |last4=Trotman |first4=Andrew |title=Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information |date=2009 |url=https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information |journal=School of Information Technologies, University of Sydney}}
HTML
Tang, Ling-Xiang; Geva, Shlomo; Xu, Yue; Trotman, Andrew. (2009). "<a href="https://wikipediaquality.com/wiki/Word_Segmentation_for_Chinese_Wikipedia_Using_N-Gram_Mutual_Information">Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information</a>". School of Information Technologies, University of Sydney.