A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords

From Wikipedia Quality
Jump to: navigation, search


A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords
Authors
Robert P. Biuk-Aghai
Ka Kit Ng
Publication date
2014
DOI
10.1109/ICODSE.2014.7062484
Links
Original

A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords - scientific work related to Wikipedia quality published in 2014, written by Robert P. Biuk-Aghai and Ka Kit Ng.

Overview

The pace of knowledge creation such as in academic research has accelerated rapidly in recent years, resulting in ever more new research publications. This has made it difficult to keep abreast of new developments, or to know which new publications are relevant to a given research area. Authors have developed a method for analysing and automatically classifying publications. Authors method makes use of the Wikipedia category hierarchy, and the content of Wikipedia articles associated to Wikipedia categories. Initially authors perform pre-processing and simplification of the Wikipedia category hierarchy, resulting in a rooted directed graph. Wikipedia articles are then analysed, and a set of keywords per Wikipedia category are extracted using a modified tf-idf (term frequency-inverse document frequency) model proposed in this paper. To classify a given input document, tf-idf weights are used to extract relevant keywords from the document, which are then matched to the keywords previously extracted from Wikipedia. The closest matching top-level categories are identified from all categories containing the document's keywords. A cosine similarity metric is then applied to select the closest matching sub-category, recursing down the category hierarchy until the best matching categories are identified. The final result produced shows a set of categories matching the input document, together with a matching percentage. This result can be used to identify new documents that are relevant to a specific research area, or to classify a whole set of documents into different topic areas, with sub-topics, main keywords, and associated weights. Authors present an experimental study using data from English Wikipedia.

Embed

Wikipedia Quality

Biuk-Aghai, Robert P.; Ng, Ka Kit. (2014). "[[A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords]]".DOI: 10.1109/ICODSE.2014.7062484.

English Wikipedia

{{cite journal |last1=Biuk-Aghai |first1=Robert P. |last2=Ng |first2=Ka Kit |title=A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords |date=2014 |doi=10.1109/ICODSE.2014.7062484 |url=https://wikipediaquality.com/wiki/A_Method_for_Automated_Document_Classification_Using_Wikipedia-Derived_Weighted_Keywords}}

HTML

Biuk-Aghai, Robert P.; Ng, Ka Kit. (2014). &quot;<a href="https://wikipediaquality.com/wiki/A_Method_for_Automated_Document_Classification_Using_Wikipedia-Derived_Weighted_Keywords">A Method for Automated Document Classification Using Wikipedia-Derived Weighted Keywords</a>&quot;.DOI: 10.1109/ICODSE.2014.7062484.