Extracting Knowledge from Wikipedia Articles Through Distributed Semantic Analysis

From Wikipedia Quality
Revision as of 16:59, 24 July 2019 by Madison (talk | contribs) (Extracting Knowledge from Wikipedia Articles Through Distributed Semantic Analysis - new page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Extracting Knowledge from Wikipedia Articles Through Distributed Semantic Analysis - scientific work related to Wikipedia quality published in 2013, written by Nguyen Trung Hieu, Mario Di Francesco and Antti Ylä-Jääski.

Overview

Computing semantic word similarity and relatedness requires access to vast amounts of semantic space for effective analysis. As a consequence, it is time-consuming to extract useful information from a large amount of data on a single workstation. In this paper, authors propose a system, called Distributed Semantic Analysis (DSA), that integrates a distributed-based approach with semantic analysis. DSA builds a list of concept vectors associated with each word by exploiting the knowledge provided by Wikipedia articles. Based on such lists, DSA calculates the degree of semantic relatedness between two words through the cosine measure. The proposed solution is built on top of the Hadoop MapReduce framework and the Mahout machine learning library. Experimental results show two major improvements over the state of the art, with particular reference to the Explicit Semantic Analysis method. First, distributed approach significantly reduces the computation time to build the concept vectors, thus enabling the use of larger inputs that is the basis for more accurate results. Second, DSA obtains a very high correlation of computed relatedness with reference benchmarks derived by human judgements. Moreover, its accuracy is higher than solutions reported in the literature over multiple benchmarks.