Guiding Theme B3: Data-driven paraphrasing and harmonization of language style

Guiding Theme B3: Data-driven paraphrasing and harmonization of language style

When compiling a summary from heterogeneous sources, it must be homogenized with respect to style, genre, and text quality. For example, an executive summary requires different vocabulary and style than a topical survey for internal use in a research group. In this guiding theme, we are concerned with using paraphrasing techniques and language modelling to measure and unify stylistic characteristics. To this end, we are interested in getting rich statistical and deep models combined.

Research results of the first Ph.D. cohort

The main focus of the first phase was on paraphrasing systems. In particular, we adapted supervised approaches for English lexical substitution to German (Hintz and Biemann, 2015; Hintz, 2016). However, to scale to the requirements of multiple domain content, our main interest was to move towards unsupervised methods. In Hintz and Biemann (2016), we proposed a framework based on transfer learning, utilizing training and evaluation data from three different languages. This allowed us to learn a model that is independent of the underlying language. We also created a combined, focused web crawling system that automatically collects relevant documents and minimizes the amount of irrelevant web content (Remus and Biemann, 2016). The collected web data was then semantically processed in order to acquire rich in-domain knowledge. To extract relations in an unsupervised fashion, we then used distributional similarities between two pairs of nominals based on dependency paths as context (Levy et al., 2015).

Ongoing project of the 2nd Ph.D. cohort

In the second phase, we continue and extend this work. With larger and larger amounts of content published every day, it is easy to miss the big picture. Populating weighted structured database such as Markov Logic networks from unstructured, noisy input sources such as text documents offers a powerful way to connect the dots. They induce global and large statistical models that naturally go beyond single sentences, paragraphs, or documents by combining disjoint pieces of data.

However, they should also locally agree with, say, deep networks. In turn, the deep network has to fulfill global constraints imposed by the weighted structured database. For instance, when applied to a set S of labeled examples that is a member of relation r, the deep network should classify at least one mention in S as a positive instance of r. Moreover, since weighted structured databases induce very complex models, we want to make tractable inference a first-class citizen of them. To this end, we will extend sum-product networks (SPNs) and other deep probabilistic models to the relational case. This poses new challenges to scaling inference and learning of SPNs but also holds the potential to realize the next generation of deep learning that naturally quantifies its uncertainties. Overall, we aim at developing a unifying framework of statistical relational and deep learning.

The results of this guiding theme can be used to combine the trained models developed in area A with additional knowledge (e.g. domain knowledge) using probabilistic programs. The programs transform relational features into a weighted database with entities, relationships, and their provenance. Similar, the results of this guiding theme can be used as support for  B1, B2, D1, and D2 by combining the different building blocks in a single, relational model.


  • PI (Second Cohort): Prof. Dr. Kristian Kersting
  • PI (First Cohort): Prof. Dr. Chris Biemann
  • First Cohort PhD student: Gerold Hintz
  • Second Cohort PhD student: Fabrizio Ventola


  • Alejandro Molina, Antonio Vergari, Nicola Di Mauro, Sriraam Natarajan, Floriana Esposito, Kristian Kersting. Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains. AAAI 2018.
  • Luc De Raedt, Kristian Kersting, Sriraam Natarajan, David Poole. Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers 2016.
  • Steffen Remus and Chris Biemann. Domain-Specific Corpus Expansion with Focused Webcrawling. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pages 3607–3611, Portorož, Slovenia, May 2016.
  • Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 970–976, Denver, CO, USA, May–June 2015.
  • György Szarvas, Chris Biemann, and Iryna Gurevych. Supervised All-Words Lexical Substitution using Delexicalized Features. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 131–1141, Atlanta, GA, USA, 2013.


Hintz, Gerold ; Biemann, Chris (2016):
Language Transfer Learning for Supervised Lexical Substitution.
In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,

Hintz, Gerold (2016):
Data-driven Paraphrasing and Stylistic Harmonization.
In: Proceedings of the 2016 Conference of the NAACL Student Research Workshop,

Hintz, Gerold ; Biemann, Chris (2015):
Delexicalized Supervised {German} Lexical Substitution.
In: Proceedings of GermEval 2015: LexSub, S. 11. [Article]

go to TU-biblio search on ULB website