Building Linguistic Corpora from Wikipedia Articles and Discussions
Authors | Eliza Margaretha Harald Lüngen |
---|---|
Publication date | 2014 |
Links | Original Preprint |
Building Linguistic Corpora from Wikipedia Articles and Discussions - scientific work related to Wikipedia quality published in 2014, written by Eliza Margaretha and Harald Lüngen.
Overview
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. Authors built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Authors approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. Authors also introduce a method to segment user contributions in talk pages into postings.
Embed
Wikipedia Quality
Margaretha, Eliza; Lüngen, Harald. (2014). "[[Building Linguistic Corpora from Wikipedia Articles and Discussions]]".
English Wikipedia
{{cite journal |last1=Margaretha |first1=Eliza |last2=Lüngen |first2=Harald |title=Building Linguistic Corpora from Wikipedia Articles and Discussions |date=2014 |url=https://wikipediaquality.com/wiki/Building_Linguistic_Corpora_from_Wikipedia_Articles_and_Discussions}}
HTML
Margaretha, Eliza; Lüngen, Harald. (2014). "<a href="https://wikipediaquality.com/wiki/Building_Linguistic_Corpora_from_Wikipedia_Articles_and_Discussions">Building Linguistic Corpora from Wikipedia Articles and Discussions</a>".