Identifying Duplicate and Contradictory Information in Wikipedia

From Wikipedia Quality
Jump to: navigation, search
Identifying Duplicate and Contradictory Information in Wikipedia
Authors
Sarah Weissman
Samet Ayhan
Joshua Bradley
Jimmy Lin
Publication date
2015
ISSN
15525996
ISBN
978-145033594-2
DOI
10.1145/2756406.2756947
Links

Identifying Duplicate and Contradictory Information in Wikipedia - scientific work about Wikipedia quality published in 2015, written by Sarah Weissman, Samet Ayhan, Joshua Bradley and Jimmy Lin.

Overview

In this paper, authors identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, authors discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.

Embed

Wikipedia Quality

Weissman, Sarah; Ayhan, Samet; Bradley, Joshua; Lin, Jimmy. (2015). "[[Identifying Duplicate and Contradictory Information in Wikipedia]]". Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61. ISBN: 978-145033594-2. ISSN: 15525996. DOI: 10.1145/2756406.2756947.

English Wikipedia

{{cite journal |last1=Weissman |first1=Sarah |last2=Ayhan |first2=Samet |last3=Bradley |first3=Joshua |last4=Lin |first4=Jimmy |title=Identifying Duplicate and Contradictory Information in Wikipedia |date=2015 |isbn=978-145033594-2 |issn=15525996 |doi=10.1145/2756406.2756947 |url=https://wikipediaquality.com/wiki/Identifying_Duplicate_and_Contradictory_Information_in_Wikipedia |journal=Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61}}

HTML

Weissman, Sarah; Ayhan, Samet; Bradley, Joshua; Lin, Jimmy. (2015). &quot;<a href="https://wikipediaquality.com/wiki/Identifying_Duplicate_and_Contradictory_Information_in_Wikipedia">Identifying Duplicate and Contradictory Information in Wikipedia</a>&quot;. Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61. ISBN: 978-145033594-2. ISSN: 15525996. DOI: 10.1145/2756406.2756947.