An Ensemble Approach for Text Document Clustering Using Wikipedia Concepts

An Ensemble Approach for Text Document Clustering Using Wikipedia Concepts - scientific work related to Wikipedia quality published in 2014, written by Seyednaser Nourashrafeddin, Evangelos E. Milios and Dirk V. Arnold.

Overview

Most text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two semantically similar documents may sit in different clusters if they do not share any terms. One solution to this problem is to enrich the document representation using an external resource like Wikipedia. Authors propose a new way to integrate Wikipedia concepts in partitional text document clustering in this work. A text corpus is first represented as a document-term matrix and a document-concept matrix. Terms that exist in the corpus are then clustered based on the document-term representation. Given the term clusters, authors propose two methods, one based on the document-term representation and the other one based on the document-concept representation, to find two sets of seed documents. The two sets are then used in text clustering algorithm in an ensemble approach to cluster documents. The experimental results show that even though the document-concept representations do not result in good document clusters per se, integrating them in ensemble approach improves the quality of document clusters significantly.

An Ensemble Approach for Text Document Clustering Using Wikipedia Concepts

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools