Difference between revisions of "Detecting Vandalism on Wikipedia Across Multiple Languages"

From Wikipedia Quality
Jump to: navigation, search
(Starting an article - Detecting Vandalism on Wikipedia Across Multiple Languages)
 
(Int.links)
 
Line 1: Line 1:
'''Detecting Vandalism on Wikipedia Across Multiple Languages''' - scientific work related to Wikipedia quality published in 2015, written by Khoi-Nguyen Dao Tran.
+
'''Detecting Vandalism on Wikipedia Across Multiple Languages''' - scientific work related to [[Wikipedia quality]] published in 2015, written by [[Khoi-Nguyen Dao Tran]].
  
 
== Overview ==
 
== Overview ==
Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, authors present research into CLVD on Wikipedia, where authors identify gaps and problems in existing vandalism detection techniques. To begin thesis, authors introduce the problem of vandalism on Wikipedia with motivating examples, and then present a review of the literature. From this review, authors identify and address the following research gaps. First, authors propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, authors investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, authors propose new text features that are more suitable for CLVD than text features from the literature. Fourth, authors propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that techniques of detecting malicious activities are not limited to Wikipedia, authors apply feature sets to detecting malicious attachments and URLs in spam emails. Overall, ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia.
+
Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as [[Wikipedia]]. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of [[English Wikipedia]] articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent [[features]] of vandalism that are common to [[multiple languages]], and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, authors present research into CLVD on Wikipedia, where authors identify gaps and problems in existing vandalism detection techniques. To begin thesis, authors introduce the problem of vandalism on Wikipedia with motivating examples, and then present a review of the literature. From this review, authors identify and address the following research gaps. First, authors propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, authors investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, authors propose new text features that are more suitable for CLVD than text features from the literature. Fourth, authors propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that techniques of detecting malicious activities are not limited to Wikipedia, authors apply feature sets to detecting malicious attachments and URLs in spam emails. Overall, ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia.

Latest revision as of 02:10, 24 May 2020

Detecting Vandalism on Wikipedia Across Multiple Languages - scientific work related to Wikipedia quality published in 2015, written by Khoi-Nguyen Dao Tran.

Overview

Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, authors present research into CLVD on Wikipedia, where authors identify gaps and problems in existing vandalism detection techniques. To begin thesis, authors introduce the problem of vandalism on Wikipedia with motivating examples, and then present a review of the literature. From this review, authors identify and address the following research gaps. First, authors propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, authors investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, authors propose new text features that are more suitable for CLVD than text features from the literature. Fourth, authors propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that techniques of detecting malicious activities are not limited to Wikipedia, authors apply feature sets to detecting malicious attachments and URLs in spam emails. Overall, ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia.