Identifying Duplicate and Contradictory Information in Wikipedia
Authors | Sarah Weissman Samet Ayhan Joshua Bradley Jimmy Lin |
---|---|
Publication date | 2015 |
ISSN | 15525996 |
ISBN | 978-145033594-2 |
DOI | 10.1145/2756406.2756947 |
Links |
Identifying Duplicate and Contradictory Information in Wikipedia - scientific work about Wikipedia quality published in 2015, written by Sarah Weissman, Samet Ayhan, Joshua Bradley and Jimmy Lin.
Overview
In this paper, authors identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, authors discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.
Embed
Wikipedia Quality
Weissman, Sarah; Ayhan, Samet; Bradley, Joshua; Lin, Jimmy. (2015). "[[Identifying Duplicate and Contradictory Information in Wikipedia]]". Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61. ISBN: 978-145033594-2. ISSN: 15525996. DOI: 10.1145/2756406.2756947.
English Wikipedia
{{cite journal |last1=Weissman |first1=Sarah |last2=Ayhan |first2=Samet |last3=Bradley |first3=Joshua |last4=Lin |first4=Jimmy |title=Identifying Duplicate and Contradictory Information in Wikipedia |date=2015 |isbn=978-145033594-2 |issn=15525996 |doi=10.1145/2756406.2756947 |url=https://wikipediaquality.com/wiki/Identifying_Duplicate_and_Contradictory_Information_in_Wikipedia |journal=Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61}}
HTML
Weissman, Sarah; Ayhan, Samet; Bradley, Joshua; Lin, Jimmy. (2015). "<a href="https://wikipediaquality.com/wiki/Identifying_Duplicate_and_Contradictory_Information_in_Wikipedia">Identifying Duplicate and Contradictory Information in Wikipedia</a>". Handbook of Research on Innovations in Information Retrieval, Analysis, and Management September 01, 2015, pp. 41-61. ISBN: 978-145033594-2. ISSN: 15525996. DOI: 10.1145/2756406.2756947.