Difference between revisions of "Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia"

From Wikipedia Quality
Jump to: navigation, search
(Information about: Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia)
 
(wikilinks)
Line 1: Line 1:
'''Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia''' - scientific work related to Wikipedia quality published in 2011, written by Hannes Dohrn and Dirk Riehle.
+
'''Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia''' - scientific work related to [[Wikipedia quality]] published in 2011, written by [[Hannes Dohrn]] and [[Dirk Riehle]].
  
 
== Overview ==
 
== Overview ==
The heart of each wiki, including Wikipedia, is its content. Most machine processing starts and ends with this content. At present, such processing is limited, because most wiki engines today cannot provide a complete and precise representation of the wiki's content. They can only generate HTML. The main reason is the lack of well-defined parsers that can handle the complexity of modern wiki markup. This applies to Media Wiki, the software running Wikipedia, and most other wiki engines. This paper shows why it has been so difficult to develop comprehensive parsers for wiki markup. It presents the design and implementation of a parser for Wikitext, the wiki markup language of MediaWiki. Authors use parsing expression grammars where most parsers used no grammars or grammars poorly suited to the task. Using this parser it is possible to directly and precisely query the structured data within wikis, including Wikipedia. The parser is available as open source from http://sweble.org
+
The heart of each wiki, including [[Wikipedia]], is its content. Most machine processing starts and ends with this content. At present, such processing is limited, because most wiki engines today cannot provide a complete and precise representation of the wiki's content. They can only generate HTML. The main reason is the lack of well-defined parsers that can handle the complexity of modern wiki markup. This applies to Media Wiki, the software running Wikipedia, and most other wiki engines. This paper shows why it has been so difficult to develop comprehensive parsers for wiki markup. It presents the design and implementation of a parser for Wikitext, the wiki markup language of [[MediaWiki]]. Authors use parsing expression grammars where most parsers used no grammars or grammars poorly suited to the task. Using this parser it is possible to directly and precisely query the structured data within wikis, including Wikipedia. The parser is available as [[open source]] from http://sweble.org

Revision as of 09:44, 2 July 2020

Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia - scientific work related to Wikipedia quality published in 2011, written by Hannes Dohrn and Dirk Riehle.

Overview

The heart of each wiki, including Wikipedia, is its content. Most machine processing starts and ends with this content. At present, such processing is limited, because most wiki engines today cannot provide a complete and precise representation of the wiki's content. They can only generate HTML. The main reason is the lack of well-defined parsers that can handle the complexity of modern wiki markup. This applies to Media Wiki, the software running Wikipedia, and most other wiki engines. This paper shows why it has been so difficult to develop comprehensive parsers for wiki markup. It presents the design and implementation of a parser for Wikitext, the wiki markup language of MediaWiki. Authors use parsing expression grammars where most parsers used no grammars or grammars poorly suited to the task. Using this parser it is possible to directly and precisely query the structured data within wikis, including Wikipedia. The parser is available as open source from http://sweble.org