Clustering Documents with Active Learning Using Wikipedia

From Wikipedia Quality
Revision as of 09:53, 2 November 2020 by Grace (talk | contribs) (+ Infobox work)
Jump to: navigation, search


Clustering Documents with Active Learning Using Wikipedia
Authors
Anna-Lan Huang
David N. Milne
Eibe Frank
Ian H. Witten
Publication date
2008
DOI
10.1109/ICDM.2008.80
Links
Original

Clustering Documents with Active Learning Using Wikipedia - scientific work related to Wikipedia quality published in 2008, written by Anna-Lan Huang, David N. Milne, Eibe Frank and Ian H. Witten.

Overview

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper authors propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. Authors first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. Authors then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. Authors test approach on three standard text document datasets. Empirical results show that basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.