Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search

From Wikipedia Quality
Revision as of 09:52, 4 June 2019 by Aurora (talk | contribs) (Basic information on Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search - scientific work related to Wikipedia quality published in 2007, written by Martin Potthast.

Overview

Authors develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Authors approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of technology authors have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles. 1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of indexing technology are similarity hashing and minimal perfect hashing.