Building Linguistic Corpora from Wikipedia Articles and Discussions

From Wikipedia Quality
Revision as of 09:10, 19 July 2019 by Liliana (talk | contribs) (Embed)
Jump to: navigation, search


Building Linguistic Corpora from Wikipedia Articles and Discussions
Authors
Eliza Margaretha
Harald Lüngen
Publication date
2014
Links
Original Preprint

Building Linguistic Corpora from Wikipedia Articles and Discussions - scientific work related to Wikipedia quality published in 2014, written by Eliza Margaretha and Harald Lüngen.

Overview

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. Authors built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Authors approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. Authors also introduce a method to segment user contributions in talk pages into postings.

Embed

Wikipedia Quality

Margaretha, Eliza; Lüngen, Harald. (2014). "[[Building Linguistic Corpora from Wikipedia Articles and Discussions]]".

English Wikipedia

{{cite journal |last1=Margaretha |first1=Eliza |last2=Lüngen |first2=Harald |title=Building Linguistic Corpora from Wikipedia Articles and Discussions |date=2014 |url=https://wikipediaquality.com/wiki/Building_Linguistic_Corpora_from_Wikipedia_Articles_and_Discussions}}

HTML

Margaretha, Eliza; Lüngen, Harald. (2014). &quot;<a href="https://wikipediaquality.com/wiki/Building_Linguistic_Corpora_from_Wikipedia_Articles_and_Discussions">Building Linguistic Corpora from Wikipedia Articles and Discussions</a>&quot;.