Language Independent Identification of Parallel Sentences Using Wikipedia

From Wikipedia Quality
Revision as of 12:11, 2 November 2020 by Bella (talk | contribs) (+ wikilinks)
Jump to: navigation, search

Language Independent Identification of Parallel Sentences Using Wikipedia - scientific work related to Wikipedia quality published in 2011, written by Rohit G. Bharadwaj and Vasudeva Varma.

Overview

This paper details a novel classification based approach to identify parallel sentences between two languages in a language independent way. Authors substitute the required language specific resources by the richly structured multilingual content, Wikipedia. Authors approach is particularly useful to extract parallel sentences for under-resourced languages like most Indian and African languages, where resources are not readily available with necessary accuracies. Authors extract various statistics based on the cross lingual links present in Wikipedia and use them to generate feature vectors for each sentence pair. Binary classification of each pair of sentences into parallel or non-parallel has been done using these feature vectors. Authors achieved a precision upto 78% which is encouraging when compared to other state-of-art approaches.These results support hypothesis of using Wikipedia to evaluate the parallel coefficient between sentences that can be used to build bilingual dictionaries.