The Wikipedia Corpus

From Wikipedia Quality
Revision as of 23:11, 19 May 2019 by Expert (talk | contribs) (New work - The Wikipedia Corpus)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


The Wikipedia Corpus
Authors
Jeff Pasternack
Dan Roth
Publication date
2008
Links
Original

The Wikipedia Corpus - scientific work related to Wikipedia quality published in 2008, written by Jeff Pasternack and Dan Roth.

Overview

Wikipedia, the popular online encyclopedia, has in just six years grown from an adjunct to the now-defunct Nupedia to over 31 million pages and 429 million revisions in 256 languages and spawned sister projects such as Wiktionary and Wikisource. Available under the GNU Free Documentation License, it is an extraordinarily large corpus with broad scope and constant updates. Its articles are largely consistent in structure and organized into category hierarchies. However, the wiki method of collaborative editing creates challenges that must be addressed. Wikipedia’s accuracy is frequently questioned, and systemic bias means that quality and coverage are uneven, while even the variety of English dialects juxtaposed can sabotage the unwary with differences in semantics, diction and spelling. This paper examines Wikipedia from a research perspective, providing basic background knowledge and an understanding of its strengths and weaknesses. Authors also solve a technical challenge posed by the enormity of text (1.04TB for the English version) made available with a simple, easily-implemented dictionary compression algorithm that permits time-efficient random access to the data with a twenty-eight-fold reduction in size.

Embed

Wikipedia Quality

Pasternack, Jeff; Roth, Dan. (2008). "[[The Wikipedia Corpus]]".

English Wikipedia

{{cite journal |last1=Pasternack |first1=Jeff |last2=Roth |first2=Dan |title=The Wikipedia Corpus |date=2008 |url=https://wikipediaquality.com/wiki/The_Wikipedia_Corpus}}

HTML

Pasternack, Jeff; Roth, Dan. (2008). &quot;<a href="https://wikipediaquality.com/wiki/The_Wikipedia_Corpus">The Wikipedia Corpus</a>&quot;.