Improving Classification Accuracy Using Automatically Extracted Training Data

Improving Classification Accuracy Using Automatically Extracted Training Data
Authors	Ariel D. Fuxman Anitha Kannan Andrew B. Goldberg Rakesh Agrawal Panayiotis Tsaparas John C. Shafer
Publication date	2009
ISBN	978-160558495-9
DOI	10.1145/1557019.1557143
Links

Improving Classification Accuracy Using Automatically Extracted Training Data - scientific work about Wikipedia quality published in 2009, written by Ariel D. Fuxman, Anitha Kannan, Andrew B. Goldberg, Rakesh Agrawal, Panayiotis Tsaparas and John C. Shafer.

Overview

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. Authors explore how authors can apply the lessons from the NLP community to KDD tasks. Specifically, authors investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. Authors carry out this investigation for the specific task of inferring whether a search query has commercial intent. Authors mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and noncommercial (e.g., Wikipedia). Authors compare the accuracy obtained using such training data against manually labeled training data. Their results show that authors can have large accuracy gains using automatically extracted training data at much lower cost.

Embed

Wikipedia Quality

Fuxman, Ariel D.; Kannan, Anitha; Goldberg, Andrew B.; Agrawal, Rakesh; Tsaparas, Panayiotis; Shafer, John C.. (2009). "[[Improving Classification Accuracy Using Automatically Extracted Training Data]]". Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009, pp. 1145-1153. ISBN: 978-160558495-9. DOI: 10.1145/1557019.1557143.

English Wikipedia

{{cite journal |last1=Fuxman |first1=Ariel D. |last2=Kannan |first2=Anitha |last3=Goldberg |first3=Andrew B. |last4=Agrawal |first4=Rakesh |last5=Tsaparas |first5=Panayiotis |last6=Shafer |first6=John C. |title=Improving Classification Accuracy Using Automatically Extracted Training Data |date=2009 |isbn=978-160558495-9 |doi=10.1145/1557019.1557143 |url=https://wikipediaquality.com/wiki/Improving_Classification_Accuracy_Using_Automatically_Extracted_Training_Data |journal=Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009, pp. 1145-1153}}

HTML

Fuxman, Ariel D.; Kannan, Anitha; Goldberg, Andrew B.; Agrawal, Rakesh; Tsaparas, Panayiotis; Shafer, John C.. (2009). "<a href="https://wikipediaquality.com/wiki/Improving_Classification_Accuracy_Using_Automatically_Extracted_Training_Data">Improving Classification Accuracy Using Automatically Extracted Training Data</a>". Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009, pp. 1145-1153. ISBN: 978-160558495-9. DOI: 10.1145/1557019.1557143.

Improving Classification Accuracy Using Automatically Extracted Training Data

Contents

Overview

Embed

Wikipedia Quality

English Wikipedia

HTML

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools