Difference between revisions of "Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms"

From Wikipedia Quality
Jump to: navigation, search
(New study: Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms)
 
(wikilinks)
Line 1: Line 1:
'''Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms''' - scientific work related to Wikipedia quality published in 2008, written by Rani Nelken and Elif Yamangil.
+
'''Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms''' - scientific work related to [[Wikipedia quality]] published in 2008, written by [[Rani Nelken]] and [[Elif Yamangil]].
  
 
== Overview ==
 
== Overview ==
Authors present a novel paradigm for obtaining large amounts of training data for computational linguistics tasks by mining Wikipedia’s article revision history. By comparing adjacent versions of the same article, authors extract voluminous training data for tasks for which data is usually scarce or costly to obtain. Authors illustrate this paradigm by applying it to three separate text processing tasks at various levels of linguistic granularity. Authors first apply this approach to the collection of textual errors and their correction, focusing on the specific type of lexical errors known as “eggcorns”. Second, moving up to the sentential level, authors show how to mine Wikipedia revisions for training sentence compression algorithms. By dramatically increasing the size of the available training data, authors are able to create more discerning lexicalized models, providing improved compression results. Finally, moving up to the document level, authors present some preliminary ideas on how to use the Wikipedia data to bootstrap text summarization systems. Authors propose to use a sentence’s persistence throughout a document’s evolution as an indicator of its fitness as part of an extractive summary.
+
Authors present a novel paradigm for obtaining large amounts of training data for computational linguistics tasks by mining [[Wikipedia]]’s article revision history. By comparing adjacent versions of the same article, authors extract voluminous training data for tasks for which data is usually scarce or costly to obtain. Authors illustrate this paradigm by applying it to three separate text processing tasks at various levels of linguistic granularity. Authors first apply this approach to the collection of textual errors and their correction, focusing on the specific type of lexical errors known as “eggcorns”. Second, moving up to the sentential level, authors show how to mine Wikipedia revisions for training sentence compression algorithms. By dramatically increasing the size of the available training data, authors are able to create more discerning lexicalized models, providing improved compression results. Finally, moving up to the document level, authors present some preliminary ideas on how to use the Wikipedia data to bootstrap text summarization systems. Authors propose to use a sentence’s persistence throughout a document’s evolution as an indicator of its fitness as part of an extractive summary.

Revision as of 19:03, 14 June 2019

Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms - scientific work related to Wikipedia quality published in 2008, written by Rani Nelken and Elif Yamangil.

Overview

Authors present a novel paradigm for obtaining large amounts of training data for computational linguistics tasks by mining Wikipedia’s article revision history. By comparing adjacent versions of the same article, authors extract voluminous training data for tasks for which data is usually scarce or costly to obtain. Authors illustrate this paradigm by applying it to three separate text processing tasks at various levels of linguistic granularity. Authors first apply this approach to the collection of textual errors and their correction, focusing on the specific type of lexical errors known as “eggcorns”. Second, moving up to the sentential level, authors show how to mine Wikipedia revisions for training sentence compression algorithms. By dramatically increasing the size of the available training data, authors are able to create more discerning lexicalized models, providing improved compression results. Finally, moving up to the document level, authors present some preliminary ideas on how to use the Wikipedia data to bootstrap text summarization systems. Authors propose to use a sentence’s persistence throughout a document’s evolution as an indicator of its fitness as part of an extractive summary.