Text as a Product

Text as a Product

Using machine learning for linguistic analysis


Structural Linguistics and recent machine learning methods share the notion of text generation from mental representations. While linguistics views the genesis of a text as the encoding of the system of meaning into a text (cf. e.g. Halliday and Hasan, 1976, p. 4), topic models constitute a family of probabilistic generative machine learning approaches that model the generation of documents from a distribution of topics. This project is trying to combine both worlds. At this, we examine whether besides the structural analogy, there are also analogies between parts of the system of meaning and machine-learned topics.


For our analysis, we annotate a text corpus with lexical cohesion relations and automatically acquire topics. Then, we use the topics to predict lexical cohesion, at this using topic membership of lexical items and significance scores between lexical items to inform an automatic system for lexical chain annotation. Besides aiming at a state-of-the art system for lexical chain identification, we analyse the semiotic interpretability of stochastic methods.


This project examines the correspondence of linguistic concepts and automatically extracted topic models. Specifically, we utilize LDA topics (Blei & Lafferty, 2009) to model lexical cohesion. For this, we annotate text with the cohesion relation and use topics and lexical co-occurrence statistics as features to assess the cohesion of a text and to compute lexical chains.

Unlike syntactically inspired projects, we focus on semantic aspects of texts. Further, we use topic representations to quantify the experiential function of documents.

Figure: sample lexical chains
Figure: sample lexical chains


Cohesion in English

M.A.K. Halliday and R. Hasan

In: English Language Series, Longman, London, 1976

Topic Models

D. Blei and J. Lafferty

In: A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009. PDF


Related Pages


The LOEWE Research Center “Digital Humanities” is funded by the Hessian excellence program “Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz” (LOEWE).