Learning to Detect Vandalism in Social Content Systems: a Study on Wikipedia

From Wikipedia Quality
Revision as of 09:46, 5 July 2019 by Paisley (talk | contribs) (Creating a new page - Learning to Detect Vandalism in Social Content Systems: a Study on Wikipedia)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Learning to Detect Vandalism in Social Content Systems: a Study on Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Sara Javanmardi, David W. McDonald, Rich Caruana, Sholeh Forouzan and Cristina Videira Lopes.

Overview

A challenge facing user generated content systems is vandalism, i.e. edits that damage content quality. The high visibility and easy access to social networks makes them popular targets for vandals. Detecting and removing vandalism is critical for these user generated content systems. Because vandalism can take many forms, there are many different kinds of features that are potentially useful for detecting it. The complex nature of vandalism, and the large number of potential features, make vandalism detection difficult and time consuming for human editors. Machine learning techniques hold promise for developing accurate, tunable, and maintainable models that can be incorporated into vandalism detection tools. Authors describe a method for training classifiers for vandalism detection that yields classifiers that are more accurate on the PAN 2010 corpus than others previously developed. Because of the high turnaround in social network systems, it is important for vandalism detection tools to run in real-time. To this aim, authors use feature selection to find the minimal set of features consistent with high accuracy. In addition, because some features are more costly to compute than others, authors use cost-sensitive feature selection to reduce the total computational cost of executing models. In addition to the features previously used for spam detection, authors introduce new features based on user action histories. The user history features contribute significantly to classifier performance. The approach authors use is general and can easily be applied to other user generated content systems.