Creating Indonesian-Javanese Parallel Corpora Using Wikipedia Articles

Creating Indonesian-Javanese Parallel Corpora Using Wikipedia Articles - scientific work related to Wikipedia quality published in 2014, written by Bayu Distiawan Trisedya and Dyah Inastra.

Overview

Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, authors explore a few sentence alignment methods which have been used before for another domain. Authors want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. Authors used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, authors used sentence length based and word correspondence methods. Meanwhile, authors used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, authors can see that Wikipedia is useful enough for purpose because both approaches gave positive results.

Creating Indonesian-Javanese Parallel Corpora Using Wikipedia Articles

Overview

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools