Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

From Wikipedia Quality
Jump to: navigation, search


Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia
Authors
Fabian Flöck
Kenan Erdogan
Maribel Acosta
Publication date
2017
Links
Original Preprint

Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia - scientific work related to Wikipedia quality published in 2017, written by Fabian Flöck, Kenan Erdogan and Maribel Acosta.

Overview

Authors present a dataset that contains every instance of all tokens (~ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, authors have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. Authors show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.

Embed

Wikipedia Quality

Flöck, Fabian; Erdogan, Kenan; Acosta, Maribel. (2017). "[[Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia]]".

English Wikipedia

{{cite journal |last1=Flöck |first1=Fabian |last2=Erdogan |first2=Kenan |last3=Acosta |first3=Maribel |title=Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia |date=2017 |url=https://wikipediaquality.com/wiki/Toktrack:_a_Complete_Token_Provenance_and_Change_Tracking_Dataset_for_the_English_Wikipedia}}

HTML

Flöck, Fabian; Erdogan, Kenan; Acosta, Maribel. (2017). &quot;<a href="https://wikipediaquality.com/wiki/Toktrack:_a_Complete_Token_Provenance_and_Change_Tracking_Dataset_for_the_English_Wikipedia">Toktrack: a Complete Token Provenance and Change Tracking Dataset for the English Wikipedia</a>&quot;.