Difference between revisions of "Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud"

From Wikipedia Quality
Jump to: navigation, search
(Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud -- new article)
 
(Wikilinks)
Line 1: Line 1:
'''Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud''' - scientific work related to Wikipedia quality published in 2011, written by Rui Máximo Esteves and Chunming Rong.
+
'''Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud''' - scientific work related to [[Wikipedia quality]] published in 2011, written by [[Rui Máximo Esteves]] and [[Chunming Rong]].
  
 
== Overview ==
 
== Overview ==
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. Authors made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research authors found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. Authors found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From experience the use of the Apache Mahout is premature.
+
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. Authors made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and [[Wikipedia]]'s latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research authors found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. Authors found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From experience the use of the Apache Mahout is premature.

Revision as of 08:14, 2 May 2020

Using Mahout for Clustering Wikipedia's Latest Articles: a Comparison Between K-Means and Fuzzy C-Means in the Cloud - scientific work related to Wikipedia quality published in 2011, written by Rui Máximo Esteves and Chunming Rong.

Overview

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. Authors made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research authors found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. Authors found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From experience the use of the Apache Mahout is premature.