Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.
JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper
JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.
Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org
In addition to that, JWPL now contains the Wikipedia Revision Toolkit, which consists of two tools, the TimeMachine and the RevisionMachine. The TimeMachine can be used to reconstruct a snapshot of Wikipedia from a specific date, or to create multiple snapshots from a time span. The RevisionMachine offers efficient access to the edit history of Wikipedia articles while storing the revisions in a dedicated storage format which decreases the demand of storage space by 98%. The toolkit is described in our ACL system demonstration paper.
Downloads and further instructions how to add JWPL or the Wikipedia Revision Toolkit as Maven dependencies to your project can be found on the GitHub Project Site
The source code is provided under the LGPL v3.
- Iryna Gurevych (Principal Investigator)
- Torsten Zesch (JWPL)
- Oliver Ferschke (Wikipedia Revision Toolkit)