Difference between revisions of "Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format"

From Wikipedia Quality
Jump to: navigation, search
(Infobox work)
(+ Embed)
Line 9: Line 9:
 
== Overview ==
 
== Overview ==
 
Wikipedia has become one of the most popular resources in [[natural language processing]] and it is used in quantities of applications. However, [[Wikipedia]] requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, [[categories]], wikidata, [[infoboxes]], scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, authors describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six [[language versions]] and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. (Less)
 
Wikipedia has become one of the most popular resources in [[natural language processing]] and it is used in quantities of applications. However, [[Wikipedia]] requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, [[categories]], wikidata, [[infoboxes]], scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, authors describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six [[language versions]] and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. (Less)
 +
 +
== Embed ==
 +
=== Wikipedia Quality ===
 +
<code>
 +
<nowiki>
 +
Klang, Marcus; Nugues, Pierre. (2016). "[[Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format]]". European Language Resources Association (ELRA).
 +
</nowiki>
 +
</code>
 +
 +
=== English Wikipedia ===
 +
<code>
 +
<nowiki>
 +
{{cite journal |last1=Klang |first1=Marcus |last2=Nugues |first2=Pierre |title=Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format |date=2016 |url=https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format |journal=European Language Resources Association (ELRA)}}
 +
</nowiki>
 +
</code>
 +
 +
=== HTML ===
 +
<code>
 +
<nowiki>
 +
Klang, Marcus; Nugues, Pierre. (2016). &amp;quot;<a href="https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format">Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format</a>&amp;quot;. European Language Resources Association (ELRA).
 +
</nowiki>
 +
</code>

Revision as of 21:07, 22 September 2020


Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format
Authors
Marcus Klang
Pierre Nugues
Publication date
2016
Links
Original

Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format - scientific work related to Wikipedia quality published in 2016, written by Marcus Klang and Pierre Nugues.

Overview

Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, authors describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. (Less)

Embed

Wikipedia Quality

Klang, Marcus; Nugues, Pierre. (2016). "[[Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format]]". European Language Resources Association (ELRA).

English Wikipedia

{{cite journal |last1=Klang |first1=Marcus |last2=Nugues |first2=Pierre |title=Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format |date=2016 |url=https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format |journal=European Language Resources Association (ELRA)}}

HTML

Klang, Marcus; Nugues, Pierre. (2016). &quot;<a href="https://wikipediaquality.com/wiki/Wikiparq:_a_Tabulated_Wikipedia_Resource_Using_the_Parquet_Format">Wikiparq: a Tabulated Wikipedia Resource Using the Parquet Format</a>&quot;. European Language Resources Association (ELRA).