The Spoken Wikipedia Corpus Collection : Harvesting, Alignment and an Application to Hyperlistening
The Spoken Wikipedia Corpus Collection : Harvesting, Alignment and an Application to Hyperlistening - scientific work related to Wikipedia quality published in 2018, written by Timo Baumann, Arne Köhn and Felix Hennig.
Spoken corpora are important for speech research, but are expensive to create and do not necessarily reflect (read or spontaneous) speech ‘in the wild’. Authors report on conversion of the preexisting and freely available Spoken Wikipedia into a speech resource. The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. There are initiatives to create and sustain Spoken Wikipedia versions in many languages and hence the available data grows over time. Thousands of spoken articles are available to users who prefer a spoken over the written version. Authors turn these semi-structured collections into structured and time-aligned corpora, keeping the exact correspondence with the original hypertext as well as all available metadata. Thus, authors make the Spoken Wikipedia accessible for sustainable research. Authors present open-source software pipeline that downloads, extracts, normalizes and text–speech aligns the Spoken Wikipedia. Additional language versions can be exploited by adapting configuration files or extending the software if necessary for language peculiarities. Authors also present and analyze the resulting corpora for German, English, and Dutch, which presently total 1005 h and grow at an estimated 87 h per year. The corpora, together with software, are available via http://islrn.org/resources/684-927-624-257-3/. As a prototype usage of the time-aligned corpus, authors describe an experiment about the preferred modalities for interacting with information-rich read-out hypertext. Authors find alignments to help improve user experience and factual information access by enabling targeted interaction.