Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud

From Wikipedia Quality
Revision as of 09:14, 2 May 2020 by Ariel (talk | contribs) (Wikilinks)
Jump to: navigation, search

Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud - scientific work related to Wikipedia quality published in 2011, written by Rui Máximo Esteves and Chunming Rong.

Overview

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. Authors made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research authors found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. Authors found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From experience the use of the Apache Mahout is premature.