Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search

From Wikipedia Quality
Jump to: navigation, search


Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search
Authors
Martin Potthast
Publication date
2007
DOI
10.1145/1277741.1277977
Links
Original

Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search - scientific work related to Wikipedia quality published in 2007, written by Martin Potthast.

Overview

Authors develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Authors approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of technology authors have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles. 1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of indexing technology are similarity hashing and minimal perfect hashing.

Embed

Wikipedia Quality

Potthast, Martin. (2007). "[[Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search]]".DOI: 10.1145/1277741.1277977.

English Wikipedia

{{cite journal |last1=Potthast |first1=Martin |title=Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search |date=2007 |doi=10.1145/1277741.1277977 |url=https://wikipediaquality.com/wiki/Wikipedia_in_the_Pocket:_Indexing_Technology_for_Near-Duplicate_Detection_and_High_Similarity_Search}}

HTML

Potthast, Martin. (2007). &quot;<a href="https://wikipediaquality.com/wiki/Wikipedia_in_the_Pocket:_Indexing_Technology_for_Near-Duplicate_Detection_and_High_Similarity_Search">Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search</a>&quot;.DOI: 10.1145/1277741.1277977.