Unsupervised Construction of a Word List on Tourism from Wikipedia
Unsupervised Construction of a Word List on Tourism from Wikipedia - scientific work related to Wikipedia quality published in 2015, written by Dittaya Wanvarie, Sansanee Ek-atchariya and Thanakon Kaewwipat.
Overview
The demand for word lists in a specialized domain is increasing in language learning. Authors propose an unsupervised framework to extract a word list from Wikipedia data for a language learning class specialized on tourism. Authors extract topics in Wikipedia articles using non-negative matrix factorization. Each topic is classified as tourism related or not using articles in WikiVoyage. Authors choose paragraphs in Wikipedia that are classified as in-domain and rank words in such paragraphs by their frequencies. The proposed framework retrieves more than 90% of words in the gold list, but the extracted list still includes a large number of general terms.