Difference between revisions of "Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier"

From Wikipedia Quality
Jump to: navigation, search
(Creating a new page - Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier)
 
(Links)
Line 1: Line 1:
'''Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier''' - scientific work related to Wikipedia quality published in 2016, written by Muhammad Shulhan and Dwi H. Widyantoro.
+
'''Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier''' - scientific work related to [[Wikipedia quality]] published in 2016, written by [[Muhammad Shulhan]] and [[Dwi H. Widyantoro]].
  
 
== Overview ==
 
== Overview ==
Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in Wikipedia rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on English Wikipedia using machine learning technique by training Cascaded Random Forest (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with Random Forest (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.
+
Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in [[Wikipedia]] rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on [[English Wikipedia]] using machine learning technique by training Cascaded [[Random Forest]] (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with [[Random Forest]] (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.

Revision as of 06:42, 14 June 2019

Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier - scientific work related to Wikipedia quality published in 2016, written by Muhammad Shulhan and Dwi H. Widyantoro.

Overview

Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in Wikipedia rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on English Wikipedia using machine learning technique by training Cascaded Random Forest (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with Random Forest (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.