Clustering Documents with Active Learning Using Wikipedia

From Wikipedia Quality
Revision as of 08:26, 22 October 2020 by Madison (talk | contribs) (wikilinks)
Jump to: navigation, search

Clustering Documents with Active Learning Using Wikipedia - scientific work related to Wikipedia quality published in 2008, written by Anna-Lan Huang, David N. Milne, Eibe Frank and Ian H. Witten.

Overview

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper authors propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. Authors first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. Authors then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. Authors test approach on three standard text document datasets. Empirical results show that basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.