Semantic Similarity Measurements for Multi-Lingual Short Texts Using Wikipedia

From Wikipedia Quality
Revision as of 11:43, 16 June 2019 by Sarah (talk | contribs) (wikilinks)
Jump to: navigation, search

Semantic Similarity Measurements for Multi-Lingual Short Texts Using Wikipedia - scientific work related to Wikipedia quality published in 2014, written by Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara and Shojiro Nishio.

Overview

In this paper, authors propose two methods to measure the semantic similarity for multi-lingual and short texts by using Wikipedia. In recent years, people around the world have been continuously generating information about their local area in their own languages on social networking services. Measuring the similarity between the texts is challenging because they are often short and written in various languages. Authors methods solve this problem by incorporating inter-language links of Wikipedia into extended naive Bayes (ENB), a probabilistic method of semantic similarity measurements for short texts. The proposed methods represent a multi-lingual short text as a vector of the English version of Wikipedia articles (entities). Authors conducted an experiment on clustering of tweets written in four languages (English, Spanish, Japanese and Arabic). From the experimental results, authors confirmed that methods outperformed cross-lingual explicit semantic analysis (CL-ESA), which is a method to measure the similarity between texts written in two different languages. Moreover, methods were competitive with ENB applied to texts that have been translated into English using Google Translate. Authors methods enabled similarity measurements for multi-lingual short texts without the cost of machine translations.