Difference between revisions of "Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format"
(+ Embed) |
(Categories) |
||
Line 31: | Line 31: | ||
</nowiki> | </nowiki> | ||
</code> | </code> | ||
+ | |||
+ | |||
+ | |||
+ | [[Category:Scientific works]] | ||
+ | [[Category:English Wikipedia]] | ||
+ | [[Category:French Wikipedia]] | ||
+ | [[Category:Italian Wikipedia]] | ||
+ | [[Category:Spanish Wikipedia]] |
Latest revision as of 12:43, 26 March 2021
Authors | Marcus Klang Pierre Nugues |
---|---|
Publication date | 2016 |
Links | Original |
Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format - scientific work related to Wikipedia quality published in 2016, written by Marcus Klang and Pierre Nugues.
Overview
Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, authors describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. (Less)
Embed
Wikipedia Quality
Klang, Marcus; Nugues, Pierre. (2016). "[[Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format]]". European Language Resources Association (ELRA).
English Wikipedia
{{cite journal |last1=Klang |first1=Marcus |last2=Nugues |first2=Pierre |title=Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format |date=2016 |url=https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format |journal=European Language Resources Association (ELRA)}}
HTML
Klang, Marcus; Nugues, Pierre. (2016). "<a href="https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format">Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format</a>". European Language Resources Association (ELRA).