Difference between revisions of "Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia"

Revision as of 09:24, 2 May 2020

Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia
Authors	Dan Ştefănescu Radu Ion
Publication date	2013
Links	Original Preprint

Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia - scientific work related to Wikipedia quality published in 2013, written by Dan Ştefănescu and Radu Ion.

Overview

Parallel corpora are essential resources for certain Natural Language Processing tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Authors work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the cross-lingual Wikipedia links embedded within the documents themselves. Considering them comparable documents, authors further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki

Embed

Wikipedia Quality

Ştefănescu, Dan; Ion, Radu. (2013). "[[Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia]]".

English Wikipedia

{{cite journal |last1=Ştefănescu |first1=Dan |last2=Ion |first2=Radu |title=Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia}}

HTML

Ştefănescu, Dan; Ion, Radu. (2013). "<a href="https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia">Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia</a>".

@@ Line 10: / Line 10: @@
 == Overview ==
 Parallel corpora are essential resources for certain [[Natural Language Processing]] tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents a collection of parallel sentences extracted from the entire [[Wikipedia]] collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish. Authors work began with the processing of the publically available Wikipedia static dumps for the three languages in- volved. The existing text was stripped of the specific mark-up, cleaned of non- textual entries like images or tables and sentence-split. Then, corresponding documents for the above mentioned pairs of languages were identified using the [[cross-lingual]] Wikipedia links embedded within the documents themselves. Considering them comparable documents, authors further employed a publically available tool named LEXACC, developed during the ACCURAT project, to extract parallel sentences from the preprocessed data. LEXACC assigns a score to each extracted pair, which is a measure of the degree of parallelism between the two sentences in the pair. These scores allow researchers to select only those sentences having a certain degree of parallelism suited for their intended purposes. This resource is publically available at: http://ws.racai.ro:9191/repository/search/?q=Parallel+Wiki
+== Embed ==
+=== Wikipedia Quality ===
+<code>
+<nowiki>
+Ştefănescu, Dan; Ion, Radu. (2013). "[[Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia]]".
+</nowiki>
+</code>
+=== English Wikipedia ===
+<code>
+<nowiki>
+{{cite journal |last1=Ştefănescu |first1=Dan |last2=Ion |first2=Radu |title=Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia |date=2013 |url=https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia}}
+</nowiki>
+</code>
+=== HTML ===
+<code>
+<nowiki>
+Ştefănescu, Dan; Ion, Radu. (2013). &amp;quot;<a href="https://wikipediaquality.com/wiki/Parallel-Wiki:_a_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia">Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia</a>&amp;quot;.
+</nowiki>
+</code>

Difference between revisions of "Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia"

Revision as of 09:24, 2 May 2020

Contents

Overview

Embed

Wikipedia Quality

English Wikipedia

HTML

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools