Difference between revisions of "Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search"
(wikilinks) |
(Adding infobox) |
||
Line 1: | Line 1: | ||
+ | {{Infobox work | ||
+ | | title = Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search | ||
+ | | date = 2007 | ||
+ | | authors = [[Martin Potthast]] | ||
+ | | doi = 10.1145/1277741.1277977 | ||
+ | | link = http://dl.acm.org/citation.cfm?doid=1277741.1277977 | ||
+ | }} | ||
'''Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search''' - scientific work related to [[Wikipedia quality]] published in 2007, written by [[Martin Potthast]]. | '''Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search''' - scientific work related to [[Wikipedia quality]] published in 2007, written by [[Martin Potthast]]. | ||
== Overview == | == Overview == | ||
Authors develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Authors approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of technology authors have compiled the search index "[[Wikipedia]] in the Pocket", which contains about 2 million English and German Wikipedia articles. 1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of indexing technology are similarity hashing and minimal perfect hashing. | Authors develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Authors approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of technology authors have compiled the search index "[[Wikipedia]] in the Pocket", which contains about 2 million English and German Wikipedia articles. 1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of indexing technology are similarity hashing and minimal perfect hashing. |
Revision as of 06:11, 28 August 2019
Authors | Martin Potthast |
---|---|
Publication date | 2007 |
DOI | 10.1145/1277741.1277977 |
Links | Original |
Wikipedia in the Pocket: Indexing Technology for Near-Duplicate Detection and High Similarity Search - scientific work related to Wikipedia quality published in 2007, written by Martin Potthast.
Overview
Authors develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Authors approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of technology authors have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles. 1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of indexing technology are similarity hashing and minimal perfect hashing.