Difference between revisions of "Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters"

From Wikipedia Quality
Jump to: navigation, search
(New study: Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters)
 
(Links)
Line 1: Line 1:
'''Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters''' - scientific work related to Wikipedia quality published in 2014, written by Richi Nayak, Rachel Mills, Christopher De-Vries and Shlomo Geva.
+
'''Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters''' - scientific work related to [[Wikipedia quality]] published in 2014, written by [[Richi Nayak]], [[Rachel Mills]], [[Christopher De-Vries]] and [[Shlomo Geva]].
  
 
== Overview ==
 
== Overview ==
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. Authors found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
+
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the [[Wikipedia]] is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. Authors found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Revision as of 06:54, 17 July 2019

Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters - scientific work related to Wikipedia quality published in 2014, written by Richi Nayak, Rachel Mills, Christopher De-Vries and Shlomo Geva.

Overview

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. Authors found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.