Difference between revisions of "Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia"

From Wikipedia Quality
Jump to: navigation, search
(Infobox)
(+ embed code)
Line 10: Line 10:
 
== Overview ==
 
== Overview ==
 
Parallel corpora are essential resources for certain [[Natural Language Processing]] tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire [[Wikipedia]] collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Authors work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the [[cross-lingual]] Wikipedia links embedded within the documents themselves. Considering them comparable documents, authors further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki
 
Parallel corpora are essential resources for certain [[Natural Language Processing]] tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire [[Wikipedia]] collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Authors work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the [[cross-lingual]] Wikipedia links embedded within the documents themselves. Considering them comparable documents, authors further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki
 +
 +
== Embed ==
 +
=== Wikipedia Quality ===
 +
<code>
 +
<nowiki>
 +
Ştefănescu, Dan; Ion, Radu. (2013). "[[Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia]]".
 +
</nowiki>
 +
</code>
 +
 +
=== English Wikipedia ===
 +
<code>
 +
<nowiki>
 +
{{cite journal |last1=Ştefănescu |first1=Dan |last2=Ion |first2=Radu |title=Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia}}
 +
</nowiki>
 +
</code>
 +
 +
=== HTML ===
 +
<code>
 +
<nowiki>
 +
Ştefănescu, Dan; Ion, Radu. (2013). &amp;quot;<a href="https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia">Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia</a>&amp;quot;.
 +
</nowiki>
 +
</code>

Revision as of 09:24, 2 May 2020


Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia
Authors
Dan Ştefănescu
Radu Ion
Publication date
2013
Links
Original Preprint

Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Dan Ştefănescu and Radu Ion.

Overview

Parallel corpora are essential resources for certain Natural Language Processing tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Authors work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the cross-lingual Wikipedia links embedded within the documents themselves. Considering them comparable documents, authors further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki

Embed

Wikipedia Quality

Ştefănescu, Dan; Ion, Radu. (2013). "[[Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia]]".

English Wikipedia

{{cite journal |last1=Ştefănescu |first1=Dan |last2=Ion |first2=Radu |title=Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia}}

HTML

Ştefănescu, Dan; Ion, Radu. (2013). &quot;<a href="https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia">Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia</a>&quot;.