Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages

From Wikipedia Quality
Jump to: navigation, search


Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages
Authors
Nemanja Spasojevic
Preeti Bhargava
Guoning Hu
Publication date
2017
DOI
10.1145/3041021.3053367
Links
Original Preprint

Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages - scientific work related to Wikipedia quality published in 2017, written by Nemanja Spasojevic, Preeti Bhargava and Guoning Hu.

Overview

In this work, authors open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic. Authors also present the methodology used to generate the dataset which enriches Wikipedia markup in order to increase number of links. In addition to the main dataset, authors open up several derived datasets including mention entity co-occurrence counts and entity embeddings, as well as mappings between Freebase ids and Wikidata item ids. Authors also discuss two applications of these datasets and hope that opening them up would prove useful for the Natural Language Processing and Information Retrieval communities, as well as facilitate multi-lingual research.

Embed

Wikipedia Quality

Spasojevic, Nemanja; Bhargava, Preeti; Hu, Guoning. (2017). "[[Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages]]". International World Wide Web Conferences Steering Committee. DOI: 10.1145/3041021.3053367.

English Wikipedia

{{cite journal |last1=Spasojevic |first1=Nemanja |last2=Bhargava |first2=Preeti |last3=Hu |first3=Guoning |title=Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages |date=2017 |doi=10.1145/3041021.3053367 |url=https://wikipediaquality.com/wiki/Dawt:_Densely_Annotated_Wikipedia_Texts_Across_Multiple_Languages |journal=International World Wide Web Conferences Steering Committee}}

HTML

Spasojevic, Nemanja; Bhargava, Preeti; Hu, Guoning. (2017). &quot;<a href="https://wikipediaquality.com/wiki/Dawt:_Densely_Annotated_Wikipedia_Texts_Across_Multiple_Languages">Dawt: Densely Annotated Wikipedia Texts Across Multiple Languages</a>&quot;. International World Wide Web Conferences Steering Committee. DOI: 10.1145/3041021.3053367.