Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters

From Wikipedia Quality
Jump to: navigation, search


Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters
Authors
Richi Nayak
Rachel Mills
Christopher De-Vries
Shlomo Geva
Publication date
2014
DOI
10.1145/2663792.2663803
Links
Original

Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters - scientific work related to Wikipedia quality published in 2014, written by Richi Nayak, Rachel Mills, Christopher De-Vries and Shlomo Geva.

Overview

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. Authors found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Embed

Wikipedia Quality

Nayak, Richi; Mills, Rachel; De-Vries, Christopher; Geva, Shlomo. (2014). "[[Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters]]".DOI: 10.1145/2663792.2663803.

English Wikipedia

{{cite journal |last1=Nayak |first1=Richi |last2=Mills |first2=Rachel |last3=De-Vries |first3=Christopher |last4=Geva |first4=Shlomo |title=Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters |date=2014 |doi=10.1145/2663792.2663803 |url=https://wikipediaquality.com/wiki/Clustering_and_Labeling_a_Web_Scale_Document_Collection_Using_Wikipedia_Clusters}}

HTML

Nayak, Richi; Mills, Rachel; De-Vries, Christopher; Geva, Shlomo. (2014). &quot;<a href="https://wikipediaquality.com/wiki/Clustering_and_Labeling_a_Web_Scale_Document_Collection_Using_Wikipedia_Clusters">Clustering and Labeling a Web Scale Document Collection Using Wikipedia Clusters</a>&quot;.DOI: 10.1145/2663792.2663803.