Difference between revisions of "Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier"

From Wikipedia Quality
Jump to: navigation, search
(Links)
(Infobox)
Line 1: Line 1:
 +
{{Infobox work
 +
| title = Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier
 +
| date = 2016
 +
| authors = [[Muhammad Shulhan]]<br />[[Dwi H. Widyantoro]]
 +
| doi = 10.1109/ICAICTA.2016.7803106
 +
| link = http://ieeexplore.ieee.org/document/7803106/
 +
}}
 
'''Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier''' - scientific work related to [[Wikipedia quality]] published in 2016, written by [[Muhammad Shulhan]] and [[Dwi H. Widyantoro]].
 
'''Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier''' - scientific work related to [[Wikipedia quality]] published in 2016, written by [[Muhammad Shulhan]] and [[Dwi H. Widyantoro]].
  
 
== Overview ==
 
== Overview ==
 
Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in [[Wikipedia]] rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on [[English Wikipedia]] using machine learning technique by training Cascaded [[Random Forest]] (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with [[Random Forest]] (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.
 
Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in [[Wikipedia]] rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on [[English Wikipedia]] using machine learning technique by training Cascaded [[Random Forest]] (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with [[Random Forest]] (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.

Revision as of 11:42, 20 June 2019


Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier
Authors
Muhammad Shulhan
Dwi H. Widyantoro
Publication date
2016
DOI
10.1109/ICAICTA.2016.7803106
Links
Original

Detecting Vandalism on English Wikipedia Using Lnsmote Resampling and Cascaded Random Forest Classifier - scientific work related to Wikipedia quality published in 2016, written by Muhammad Shulhan and Dwi H. Widyantoro.

Overview

Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in Wikipedia rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on English Wikipedia using machine learning technique by training Cascaded Random Forest (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with Random Forest (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.