Wikipedia-Based Hybrid Document Representation for Textual News Classification

Wikipedia-Based Hybrid Document Representation for Textual News Classification
Authors	Marcos Mouriño García Roberto Pérez Rodríguez Manuel Vilares Ferro Luis Anido Rifón
Publication date	2016
DOI	10.1007/s00500-018-3101-5
Links	Original

Wikipedia-Based Hybrid Document Representation for Textual News Classification - scientific work related to Wikipedia quality published in 2016, written by Marcos Mouriño García, Roberto Pérez Rodríguez, Manuel Vilares Ferro and Luis Anido Rifón.

Overview

Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. On the one hand, traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. On the other hand, several authors propose the use of a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy. This paper shows the benefits of using a hybrid representation of documents to the classification of textual news, leveraging the advantages of both approaches—the traditional BoW representation and a BoC approach based on Wikipedia knowledge. To evaluate the proposal, authors used three of the most relevant algorithms in the state-of-the art—SVM, Random Forest and Naive Bayes—and two corpora: the Reuters-21578 corpus and a purpose-built corpus, Reuters-27000. Results obtained show that the performance of the classification algorithm depends on the dataset used, and also demonstrate that the enrichment of the BoW representation with the concepts extracted from documents through the semantic annotator adds useful information to the classifier and improves their performance. Experiments conducted show performance increases up to 4.12% when classifying the Reuters-21578 corpus with the SVM algorithm and up to 49.35% when classifying the corpus Reuters-27000 with the Random Forest algorithm.

Wikipedia-Based Hybrid Document Representation for Textual News Classification

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools