Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence

From Wikipedia Quality
Jump to: navigation, search


Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence
Authors
Yangqiu Song
Stephen D. Mayhew
Dan Roth
Publication date
2016
Links
Original Preprint

Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence - scientific work related to Wikipedia quality published in 2016, written by Yangqiu Song, Stephen D. Mayhew and Dan Roth.

Overview

This paper presents an approach to classify documents in any language into an English topical label space, without any text categorization training data. The approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on mapping the English labels or short category description into a Wikipedia-based semantic representation, and on the use of the target language Wikipedia. Consequently, performance could suffer when Wikipedia in the target language is small. In this paper, authors focus on languages with small Wikipedias, (Small-Wikipedia languages, SWLs). Authors use a word-level dictionary to convert documents in a SWL to a large-Wikipedia language (LWLs), and then perform CLDDC based on the LWL's Wikipedia. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision heavy approach and can be done for about 100 languages. Authors also develop a ranking algorithm that makes use of language similarity metrics to automatically select a good LWL, and show that this significantly improves classification of SWLs' documents, performing comparably to the best bridge possible.

Embed

Wikipedia Quality

Song, Yangqiu; Mayhew, Stephen D.; Roth, Dan. (2016). "[[Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence]]".

English Wikipedia

{{cite journal |last1=Song |first1=Yangqiu |last2=Mayhew |first2=Stephen D. |last3=Roth |first3=Dan |title=Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence |date=2016 |url=https://wikipediaquality.com/wiki/Cross-Lingual_Dataless_Classification_for_Languages_with_Small_Wikipedia_Presence}}

HTML

Song, Yangqiu; Mayhew, Stephen D.; Roth, Dan. (2016). &quot;<a href="https://wikipediaquality.com/wiki/Cross-Lingual_Dataless_Classification_for_Languages_with_Small_Wikipedia_Presence">Cross-Lingual Dataless Classification for Languages with Small Wikipedia Presence</a>&quot;.