Difference between revisions of "Clustering Documents with Active Learning Using Wikipedia"

From Wikipedia Quality
Jump to: navigation, search
(Basic information on Clustering Documents with Active Learning Using Wikipedia)
 
(wikilinks)
Line 1: Line 1:
'''Clustering Documents with Active Learning Using Wikipedia''' - scientific work related to Wikipedia quality published in 2008, written by Anna-Lan Huang, David N. Milne, Eibe Frank and Ian H. Witten.
+
'''Clustering Documents with Active Learning Using Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2008, written by [[Anna-Lan Huang]], [[David N. Milne]], [[Eibe Frank]] and [[Ian H. Witten]].
  
 
== Overview ==
 
== Overview ==
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper authors propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. Authors first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. Authors then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. Authors test approach on three standard text document datasets. Empirical results show that basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.
+
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper authors propose to exploit the [[semantic knowledge]] in [[Wikipedia]] for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. Authors first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. Authors then exploit the semantic [[relatedness]] between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. Authors test approach on three standard text document datasets. Empirical results show that basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.

Revision as of 08:26, 22 October 2020

Clustering Documents with Active Learning Using Wikipedia - scientific work related to Wikipedia quality published in 2008, written by Anna-Lan Huang, David N. Milne, Eibe Frank and Ian H. Witten.

Overview

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper authors propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. Authors first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. Authors then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. Authors test approach on three standard text document datasets. Empirical results show that basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.