Difference between revisions of "Phrase Detection in the Wikipedia"

From Wikipedia Quality
Jump to: navigation, search
(Creating a new page - Phrase Detection in the Wikipedia)
 
(Wikilinks)
Line 1: Line 1:
'''Phrase Detection in the Wikipedia''' - scientific work related to Wikipedia quality published in 2008, written by Miro Lehtonen and Antoine Doucet.
+
'''Phrase Detection in the Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2008, written by [[Miro Lehtonen]] and [[Antoine Doucet]].
  
 
== Overview ==
 
== Overview ==
The Wikipedia XML collection turned out to be rich of marked-up phrases as authors carried out INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, authors were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As IR system -- EXTIRP -- indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.
+
The [[Wikipedia]] XML collection turned out to be rich of marked-up phrases as authors carried out INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, authors were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As IR system -- EXTIRP -- indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.

Revision as of 08:49, 5 December 2019

Phrase Detection in the Wikipedia - scientific work related to Wikipedia quality published in 2008, written by Miro Lehtonen and Antoine Doucet.

Overview

The Wikipedia XML collection turned out to be rich of marked-up phrases as authors carried out INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, authors were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As IR system -- EXTIRP -- indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.