Difference between revisions of "Wikipedia-Based Cross-Language Text Classification"
(Int.links) |
(Adding infobox) |
||
Line 1: | Line 1: | ||
+ | {{Infobox work | ||
+ | | title = Wikipedia-Based Cross-Language Text Classification | ||
+ | | date = 2017 | ||
+ | | authors = [[Marcos Antonio Mourio Garca]]<br />[[Roberto Prez Rodrguez]]<br />[[Luis Anido Rifn]] | ||
+ | | doi = 10.1016/j.ins.2017.04.024 | ||
+ | | link = https://dl.acm.org/citation.cfm?id=3096590 | ||
+ | }} | ||
'''Wikipedia-Based Cross-Language Text Classification''' - scientific work related to [[Wikipedia quality]] published in 2017, written by [[Marcos Antonio Mourio Garca]], [[Roberto Prez Rodrguez]] and [[Luis Anido Rifn]]. | '''Wikipedia-Based Cross-Language Text Classification''' - scientific work related to [[Wikipedia quality]] published in 2017, written by [[Marcos Antonio Mourio Garca]], [[Roberto Prez Rodrguez]] and [[Luis Anido Rifn]]. | ||
== Overview == | == Overview == | ||
This paper presents the application of a [[Wikipedia]]-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches typically based on the [[machine translation]] (MT) of documents, which are represented as bags of words (BoW). Authors propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in [[different language]]s and thus not relying on automated full-text translations. Authors describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals authors conducted several experiments with three [[cross-lingual]] corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance. | This paper presents the application of a [[Wikipedia]]-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches typically based on the [[machine translation]] (MT) of documents, which are represented as bags of words (BoW). Authors propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in [[different language]]s and thus not relying on automated full-text translations. Authors describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals authors conducted several experiments with three [[cross-lingual]] corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance. |
Revision as of 08:54, 4 September 2019
Authors | Marcos Antonio Mourio Garca Roberto Prez Rodrguez Luis Anido Rifn |
---|---|
Publication date | 2017 |
DOI | 10.1016/j.ins.2017.04.024 |
Links | Original |
Wikipedia-Based Cross-Language Text Classification - scientific work related to Wikipedia quality published in 2017, written by Marcos Antonio Mourio Garca, Roberto Prez Rodrguez and Luis Anido Rifn.
Overview
This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches typically based on the machine translation (MT) of documents, which are represented as bags of words (BoW). Authors propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in different languages and thus not relying on automated full-text translations. Authors describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals authors conducted several experiments with three cross-lingual corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance.