Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

From Wikipedia Quality
Revision as of 05:50, 14 June 2019 by Stella (talk | contribs) (+ links)
Jump to: navigation, search

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids - scientific work related to Wikipedia quality published in 2014, written by Marek Lipczak, Arash Koushkestani and Evangelos E. Milios.

Overview

This article presents Tulip, an ERD system submitted to the ERD 2014: Entity Recognition and Disambiguation Challenge. The objective of the proposed system is to spot mentions of entities in a document and link the mentions to corresponding Freebase articles. To achieve it, Tulip prunes the set of entity candidates focusing on a core subset of related entities capturing the context of the document. The relationship strength is measured as a similarity to a topic centroid generated from entity features. Each entity is represented by an accurate and compact feature vector extracted from a category graph built based on information from 120 language versions of Wikipedia. Given the core set of accepted entities Tulip uses the Wikipedia-based feature vectors to extract more related entities from the document text. Tulip received the first prize in the long document track with F1 score of 0.74, which confirms the effectiveness of system. At the same, the system was faster than all other submissions with latency under 0.29 seconds.