Difference between revisions of "Mining the Spoken Wikipedia for Speech Data and Beyond"
(Links) |
(Cats.) |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{Infobox work | ||
+ | | title = Mining the Spoken Wikipedia for Speech Data and Beyond | ||
+ | | date = 2016 | ||
+ | | authors = [[Arne Köhn]]<br />[[Florian Stegen]]<br />[[Timo Baumann]] | ||
+ | | link = http://edoc.sub.uni-hamburg.de/informatik/volltexte/2016/220/pdf/koehn_spoken_wikipedia.pdf | ||
+ | }} | ||
'''Mining the Spoken Wikipedia for Speech Data and Beyond''' - scientific work related to [[Wikipedia quality]] published in 2016, written by [[Arne Köhn]], [[Florian Stegen]] and [[Timo Baumann]]. | '''Mining the Spoken Wikipedia for Speech Data and Beyond''' - scientific work related to [[Wikipedia quality]] published in 2016, written by [[Arne Köhn]], [[Florian Stegen]] and [[Timo Baumann]]. | ||
== Overview == | == Overview == | ||
Authors present a corpus of time-aligned spoken data of [[Wikipedia]] articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Authors pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which authors align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which authors align 27h in full sentence and 157h with some missing words. Results are publically available. | Authors present a corpus of time-aligned spoken data of [[Wikipedia]] articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Authors pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which authors align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which authors align 27h in full sentence and 157h with some missing words. Results are publically available. | ||
+ | |||
+ | == Embed == | ||
+ | === Wikipedia Quality === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | Köhn, Arne; Stegen, Florian; Baumann, Timo. (2016). "[[Mining the Spoken Wikipedia for Speech Data and Beyond]]". Fachbereich Informatik. | ||
+ | </nowiki> | ||
+ | </code> | ||
+ | |||
+ | === English Wikipedia === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | {{cite journal |last1=Köhn |first1=Arne |last2=Stegen |first2=Florian |last3=Baumann |first3=Timo |title=Mining the Spoken Wikipedia for Speech Data and Beyond |date=2016 |url=https://wikipediaquality.com/wiki/Mining_the_Spoken_Wikipedia_for_Speech_Data_and_Beyond |journal=Fachbereich Informatik}} | ||
+ | </nowiki> | ||
+ | </code> | ||
+ | |||
+ | === HTML === | ||
+ | <code> | ||
+ | <nowiki> | ||
+ | Köhn, Arne; Stegen, Florian; Baumann, Timo. (2016). &quot;<a href="https://wikipediaquality.com/wiki/Mining_the_Spoken_Wikipedia_for_Speech_Data_and_Beyond">Mining the Spoken Wikipedia for Speech Data and Beyond</a>&quot;. Fachbereich Informatik. | ||
+ | </nowiki> | ||
+ | </code> | ||
+ | |||
+ | |||
+ | |||
+ | [[Category:Scientific works]] | ||
+ | [[Category:English Wikipedia]] | ||
+ | [[Category:German Wikipedia]] |
Latest revision as of 08:53, 18 November 2020
Authors | Arne Köhn Florian Stegen Timo Baumann |
---|---|
Publication date | 2016 |
Links | Original |
Mining the Spoken Wikipedia for Speech Data and Beyond - scientific work related to Wikipedia quality published in 2016, written by Arne Köhn, Florian Stegen and Timo Baumann.
Overview
Authors present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Authors pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which authors align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which authors align 27h in full sentence and 157h with some missing words. Results are publically available.
Embed
Wikipedia Quality
Köhn, Arne; Stegen, Florian; Baumann, Timo. (2016). "[[Mining the Spoken Wikipedia for Speech Data and Beyond]]". Fachbereich Informatik.
English Wikipedia
{{cite journal |last1=Köhn |first1=Arne |last2=Stegen |first2=Florian |last3=Baumann |first3=Timo |title=Mining the Spoken Wikipedia for Speech Data and Beyond |date=2016 |url=https://wikipediaquality.com/wiki/Mining_the_Spoken_Wikipedia_for_Speech_Data_and_Beyond |journal=Fachbereich Informatik}}
HTML
Köhn, Arne; Stegen, Florian; Baumann, Timo. (2016). "<a href="https://wikipediaquality.com/wiki/Mining_the_Spoken_Wikipedia_for_Speech_Data_and_Beyond">Mining the Spoken Wikipedia for Speech Data and Beyond</a>". Fachbereich Informatik.