Semantic Similarity Measurements for Multi-Lingual Short Texts Using Wikipedia

Semantic Similarity Measurements for Multi-Lingual Short Texts Using Wikipedia - scientific work related to Wikipedia quality published in 2014, written by Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara and Shojiro Nishio.

Overview

In this paper, authors propose two methods to measure the semantic similarity for multi-lingual and short texts by using Wikipedia. In recent years, people around the world have been continuously generating information about their local area in their own languages on social networking services. Measuring the similarity between the texts is challenging because they are often short and written in various languages. Authors methods solve this problem by incorporating inter-language links of Wikipedia into extended naive Bayes (ENB), a probabilistic method of semantic similarity measurements for short texts. The proposed methods represent a multi-lingual short text as a vector of the English version of Wikipedia articles (entities). Authors conducted an experiment on clustering of tweets written in four languages (English, Spanish, Japanese and Arabic). From the experimental results, authors confirmed that methods outperformed cross-lingual explicit semantic analysis (CL-ESA), which is a method to measure the similarity between texts written in two different languages. Moreover, methods were competitive with ENB applied to texts that have been translated into English using Google Translate. Authors methods enabled similarity measurements for multi-lingual short texts without the cost of machine translations.

Semantic Similarity Measurements for Multi-Lingual Short Texts Using Wikipedia

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools