python and java libraries to parse wikipedia dump dataset


Here are some python and java parsers to parse wikipedia dump dataset.

The english wikipedia dump dataset can be downloaded here:

Name and link Principal author(s) Language Input Output Comments / other info License
Java API (Bliki engine) axelclk Java Markup fragment HTML, PDF Java Wikipedia API – (supports ParserFunctions, Lua/Scribunto…)  
Mylyn WikiText David Green Java Local files HTML, DocBook, Eclipse Help, DITA, extensible Integration with Ant and Eclipse runtime  
Sweble Wikitext Parser Hannes Dohrn Java Markup Abstract syntax tree, XML, HTML Claims to be very thorough. Apache License 2.0
Wikiforia Marcus Klang Java XML Dumps, Markup Text Uses the AST output from Sweble Wikitext Parser internally to produce raw text. Can parallel decompress and parse compressed multistreamed xml dumps. GPLv2
JAMWiki Ryan Java JAMWiki front-end HTML Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki.  
JWPL api Torsten Zesch,Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann Java XML Dump API to access pages, outlinks, inlinks and more “JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia.” “JWPL is for you: If you need structured access to Wikipedia in Java.”

Older parser not maintained any more – JWPL uses Sweble now.

XWiki XWiki dev team Java Various WikiMarkups Well formed sequence of events, HTML/XHTML, other WikiMarkups XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki’s markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups.

Cant output to mediawiki format as of 2016/03 though.

YaCy YaCy dev team Java XML Dump XML with Dublin Core Metadata YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer. GPL
wikitextparser 5j9 Python Markup AST Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3. GPL
Wikipedia Dump Reader Benjamin Thyreau Python XML dumps On screen Cross platform viewer GPLv2/~BSD license
mw2html Connelly Barnes Python Wiki url HTML Mininimal setup – gets the basic job of creating a static copy of the wiki done  
WikiExtractor Giuseppe Attardi, Antonio Fuschetto Python XML dumps text Simple and fast tool for extracting plain text from Wikipedia dumps. It performs template expansion and handles parser functions (core and extended). GPL
wik2dict Guaka Python SQL dump DICT Marcus Brinkmann Python Markup AST, HTML Stateful PEG parser based on Grako, with a very clean separation of parsing stages, grammars and semantic transformations BSD
mediawiki-parser Peter Potrowl
Erik Rose
Python Markup XHTML, raw text, AST GSoC-2011 project; the use of a PEG parser makes it easy to improve
Parser functions are not supported yet.
mwparserfromhell The Earwig Python Markup AST A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies. MIT License
WikiPDF Felipe Sanches Python (and PHP) One selected article LaTeX based on templates, PDF Mediawiki extension that uses Stephan Walter’s wiki2pdf as backend.  
wiki2pdf Stephan Walter Python (and PHP) Markup fragment or set of online articles LaTeX, PDF Project is incomplete and dormant  
mwlib Python with C library Markup and other parse tree, HTML, PDF, XML, OpenDocument Part of cooperation between Wikimedia Foundationand PediaPress BSD