DKPro Core 1.3.0 released

2012/03/18

We are pleased to announce the release of DKPro Core 1.3.0 – a collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

DKPro Core ASL

  • Fixed several issues with DocumentMetaData.
  • Changed features of some DKPro Types so that they start with a lower-caseletter. Breaks XMI file compatibility with older DKPro versions.
  • Added new base class JCasFileWriter_ImplBase.
  • Added reader for British National Corpus.
  • Added reader and writer for IMS Open Corpus Workbench.
  • ImsCwbWriter can use a local CWB installation to directly write the index format.
  • Can now read NeGra Export Format version 3 files (TIGER Corpus).
  • Added TEI reader mainly to be able to read text from the TEI version of the Brown Corpus for now.
  • New Web1TFormatWriter which uses an external sort mechanism to support larger n-gram models.
  • Upgraded to JWPL 0.9.0.
  • Added WikipediaPageReader, which reads articles and discussion pages.
  • Upgraded to TT4J 1.0.16 to support chinese model.
  • TreeTagger module works with any model now, even if no mapping is provided.

DKPro Core GPL

  • Updated to CoreNLP 1.3.0
  • Parser models need to be regenerated using the build.xml script. The old models cause a NullPointerException with the new version of the parser included in CoreNLP 1.3.0.

In addition there have been a number of bug fixes and enhancements. For a more complete overview see the DKPro Core ASL issue tracker and the DKPro Core GPL issue tracker.

DKPro Core consists of a number of pre-processing components for NLP tasks, often wrapping existing libraries or tools for easy use in an UIMA pipeline.

  • tokenization/segmentation
  • compound splitting (Banana Split, JWordSplitter)
  • stemming (Snowball)
  • part-of-speech tagging (TreeTagger)
  • parsing (Stanford Parser)
  • language identification (TextCat)
  • spelling correction (Jazzy)
  • IO support for various data types (text, XML, PDF, WSDL, Wikipedia, …)

A basic UIMA type system is provided with which all of the components work out-of-the-box. Some components can be configured for use with other type systems.

DKPro Core builds heavily on uimaFIT, making use of features such as injection of configuration parameters and automatic type detection. Because using DKPro in Java code with uimaFIT is so easy, we do not provide traditional UIMA XML descriptors for our analysis engines, readers and consumers – only for the type systems.

We offer two sets of components with DKPro Core:

  • DKPro Core ASL provides components under the Apache Software License 2.0
  • DKPro Core GPL provides components under the GNU Public License 3.0

DKPro Core is meant to be used with Apache Maven. We host a public Maven repository containing DKPro Core ASL, DKPro Core GPL and all their dependendies. You can also obtain JARs for individual components from that repository. For non-Maven users, we offer ZIP archives downloadable

from Google Code.

This project was initiated by the Ubiquitous Knowledge Processing Lab (UKP) at the Technische Universität Darmstadt, Germany under the auspices of Prof. Dr. Iryna Gurevych. All former and current member of the UKP Lab have contributed in code, as testers or in spirit to this project. It constitutes an essential cornerstone for our research environment at the UKP Lab.

DKPro Core requires Java 1.6, UIMA 2.4.0 and uimaFIT 1.3.0 (amongst other component-specific dependencies).

An introduction to DKPro Core can be found in the project wiki.

Please direct your questions and suggestions to dkpro-core-user@googlegroups.com .

If you wish to be notified about new releases, please subscribe to the announcements mailing list.