Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes

From Wikipedia Quality
Revision as of 09:22, 2 May 2020 by Ariel (talk | contribs) (Infobox work)
Jump to: navigation, search


Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes
Authors
Masumi Shirakawa
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
Publication date
2015
DOI
10.1109/TETC.2015.2418716
Links
Original

Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes - scientific work related to Wikipedia quality published in 2015, written by Masumi Shirakawa, Kotaro Nakayama, Takahiro Hara and Shojiro Nishio.

Overview

This paper proposes a Wikipedia-based semantic similarity measurement method that is intended for real-world noisy short texts. Authors method is a kind of explicit semantic analysis (ESA), which adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase, is a challenging practical problem because it usually consists of several subproblems, e.g., key term extraction from texts, related entity finding for each key term, and weight aggregation of related entities. Authors proposed method solves this aggregation problem using extended naive Bayes, a probabilistic weighting mechanism based on the Bayes' theorem. Authors method is effective especially when the short text is semantically noisy, i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results on Twitter message and Web snippet clustering revealed that method outperformed ESA for noisy short texts. Authors also found that reducing the dimension of the vector to representative Wikipedia entities scarcely affected the performance while decreasing the vector size and hence the storage space and the processing time of computing the cosine similarity.