A Comparable Wikipedia Corpus: from Wiki Syntax to Pos Tagged Xml

From Wikipedia Quality
Revision as of 21:12, 15 July 2019 by Mila (talk | contribs) (A Comparable Wikipedia Corpus: from Wiki Syntax to Pos Tagged Xml - new page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

A Comparable Wikipedia Corpus: from Wiki Syntax to Pos Tagged Xml - scientific work related to Wikipedia quality published in 2011, written by Noah Bubenhofer, Stefanie Haupt and Horst Schwinn.

Overview

To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar research, authors used a set of XSLT stylesheets to transform the mediawiki anntations to XML. Furthermore, the data has been amnntated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation that can be used for multilingual research in various linguistic topics.