Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections

From Wikipedia Quality
Jump to: navigation, search
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections
Authors
Xuan Hieu Phan
Nguyen Minh E. Le Nguyen
Susumu Horiguchi
Publication date
2008
ISBN
978-160558085-2
DOI
10.1145/1367497.1367510
Links

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections - scientific work about Wikipedia quality published in 2008, written by Xuan Hieu Phan, Nguyen Minh E. Le Nguyen and Susumu Horiguchi.

Overview

This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, authors collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. Authors did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.

Embed

Wikipedia Quality

Phan, Xuan Hieu; Le Nguyen, Nguyen Minh E.; Horiguchi, Susumu. (2008). "[[Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections]]". Proceedings of the National Conference on Artificial Intelligence Volume 2, 2008, pp. 1132-1137. ISBN: 978-160558085-2. DOI: 10.1145/1367497.1367510.

English Wikipedia

{{cite journal |last1=Phan |first1=Xuan Hieu |last2=Le Nguyen |first2=Nguyen Minh E. |last3=Horiguchi |first3=Susumu |title=Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections |date=2008 |isbn=978-160558085-2 |doi=10.1145/1367497.1367510 |url=https://wikipediaquality.com/wiki/Learning_to_Classify_Short_and_Sparse_Text_&_Web_with_Hidden_Topics_from_Large-Scale_Data_Collections |journal=Proceedings of the National Conference on Artificial Intelligence Volume 2, 2008, pp. 1132-1137}}

HTML

Phan, Xuan Hieu; Le Nguyen, Nguyen Minh E.; Horiguchi, Susumu. (2008). &quot;<a href="https://wikipediaquality.com/wiki/Learning_to_Classify_Short_and_Sparse_Text_&_Web_with_Hidden_Topics_from_Large-Scale_Data_Collections">Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections</a>&quot;. Proceedings of the National Conference on Artificial Intelligence Volume 2, 2008, pp. 1132-1137. ISBN: 978-160558085-2. DOI: 10.1145/1367497.1367510.