Difference between revisions of "A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords"

From Wikipedia Quality
Jump to: navigation, search
(Overview: A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords)
 
(+ wikilinks)
Line 1: Line 1:
'''A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords''' - scientific work related to Wikipedia quality published in 2014, written by Robert P. Biuk-Aghai and Ka Kit Ng.
+
'''A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords''' - scientific work related to [[Wikipedia quality]] published in 2014, written by [[Robert P. Biuk-Aghai]] and [[Ka Kit Ng]].
  
 
== Overview ==
 
== Overview ==
The pace of knowledge creation such as in academic research has accelerated rapidly in recent years, resulting in ever more new research publications. This has made it difficult to keep abreast of new developments, or to know which new publications are relevant to a given research area. Authors have developed a method for analysing and automatically classifying publications. Authors method makes use of the Wikipedia category hierarchy, and the content of Wikipedia articles associated to Wikipedia categories. Initially authors perform pre-processing and simplification of the Wikipedia category hierarchy, resulting in a rooted directed graph. Wikipedia articles are then analysed, and a set of keywords per Wikipedia category are extracted using a modified tf-idf (term frequency-inverse document frequency) model proposed in this paper. To classify a given input document, tf-idf weights are used to extract relevant keywords from the document, which are then matched to the keywords previously extracted from Wikipedia. The closest matching top-level categories are identified from all categories containing the document's keywords. A cosine similarity metric is then applied to select the closest matching sub-category, recursing down the category hierarchy until the best matching categories are identified. The final result produced shows a set of categories matching the input document, together with a matching percentage. This result can be used to identify new documents that are relevant to a specific research area, or to classify a whole set of documents into different topic areas, with sub-topics, main keywords, and associated weights. Authors present an experimental study using data from English Wikipedia.
+
The pace of knowledge creation such as in academic research has accelerated rapidly in recent years, resulting in ever more new research publications. This has made it difficult to keep abreast of new developments, or to know which new publications are relevant to a given research area. Authors have developed a method for analysing and automatically classifying publications. Authors method makes use of the [[Wikipedia]] category hierarchy, and the content of Wikipedia articles associated to [[Wikipedia categories]]. Initially authors perform pre-processing and simplification of the Wikipedia category hierarchy, resulting in a rooted directed graph. Wikipedia articles are then analysed, and a set of keywords per Wikipedia category are extracted using a modified tf-idf (term frequency-inverse document frequency) model proposed in this paper. To classify a given input document, tf-idf weights are used to extract relevant keywords from the document, which are then matched to the keywords previously extracted from Wikipedia. The closest matching top-level [[categories]] are identified from all categories containing the document's keywords. A cosine similarity metric is then applied to select the closest matching sub-category, recursing down the category hierarchy until the best matching categories are identified. The final result produced shows a set of categories matching the input document, together with a matching percentage. This result can be used to identify new documents that are relevant to a specific research area, or to classify a whole set of documents into different topic areas, with sub-topics, main keywords, and associated weights. Authors present an experimental study using data from [[English Wikipedia]].

Revision as of 13:21, 19 October 2019

A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords - scientific work related to Wikipedia quality published in 2014, written by Robert P. Biuk-Aghai and Ka Kit Ng.

Overview

The pace of knowledge creation such as in academic research has accelerated rapidly in recent years, resulting in ever more new research publications. This has made it difficult to keep abreast of new developments, or to know which new publications are relevant to a given research area. Authors have developed a method for analysing and automatically classifying publications. Authors method makes use of the Wikipedia category hierarchy, and the content of Wikipedia articles associated to Wikipedia categories. Initially authors perform pre-processing and simplification of the Wikipedia category hierarchy, resulting in a rooted directed graph. Wikipedia articles are then analysed, and a set of keywords per Wikipedia category are extracted using a modified tf-idf (term frequency-inverse document frequency) model proposed in this paper. To classify a given input document, tf-idf weights are used to extract relevant keywords from the document, which are then matched to the keywords previously extracted from Wikipedia. The closest matching top-level categories are identified from all categories containing the document's keywords. A cosine similarity metric is then applied to select the closest matching sub-category, recursing down the category hierarchy until the best matching categories are identified. The final result produced shows a set of categories matching the input document, together with a matching percentage. This result can be used to identify new documents that are relevant to a specific research area, or to classify a whole set of documents into different topic areas, with sub-topics, main keywords, and associated weights. Authors present an experimental study using data from English Wikipedia.