Leveraging Wikipedia Knowledge to Cross-Language Classify Textual News

Leveraging Wikipedia Knowledge to Cross-Language Classify Textual News - scientific work related to Wikipedia quality published in 2017, written by Marcos Mouriño-García, Roberto Pérez-Rodríguez and Luis E. Anido-Rifón.

Overview

This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. Authors describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed authors present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% Authors conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.

Leveraging Wikipedia Knowledge to Cross-Language Classify Textual News

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools