Gpx@Inex2007: Ad-Hoc Queries and Automated Link Discovery in the Wikipedia

From Wikipedia Quality
Revision as of 22:45, 17 June 2019 by Arianna (talk | contribs) (Adding wikilinks)
Jump to: navigation, search

Gpx@Inex2007: Ad-Hoc Queries and Automated Link Discovery in the Wikipedia - scientific work related to Wikipedia quality published in 2007, written by Shlomo Geva.

Overview

The INEX 2007 evaluation was based on the Wikipedia collection in XML format. In this paper authors describe some modifications to the GPX search engine and the approach taken in the Ad-hoc and the Link-the-Wiki tracks. The GPX retrieval strategy is based on the construction of a collection sub-tree, consisting of all nodes that contain one or more of the search terms. Nodes containing search terms are assigned a score using the GPX ranking scheme which incorporates an extended TF-IDF variant. In earlier version of GPX scores were recursively propagated from text containing nodes, through ancestors, all the way to the document root of the XML tree. In this paper authors describe a simplification whereby the score of each node is computed directly, doing away with the score propagation mechanism. Preliminary results indicate improved performance. The GPX search engine was used in the Link-the-Wiki track to identify prospective incoming links to new Wikipedia pages. Authors also describe a simple and efficient approach to the identification of prospective outgoing links in new Wikipedia pages. Authors present preliminary evaluation results. 1. The GPX Search Engine For the sake of completeness authors provide a very brief description of GPX. The reader is referred to earlier papers on GPX in INEX previous proceedings for a more complete description. The search engine is based on XPath inverted lists. For each term in the collection authors maintain an inverted list of XPath specifications. This includes the file name, the absolute XPath identifying a specific XML element, and the term position within the element. The actual data structure is designed for efficient storage and retrieval of the inverted list which are considerably less concise by comparison with basic text retrieval inverted lists. Authors briefly describe the data structure, then authors describe the node scoring calculation, and finally authors present the results.