VisADoc

Feature-based Visualization and Analysis of Natural Language Documents (VisADoc)

Motivation

The amount of digital text data, e.g. created by the Web users, has been rapidly growing over the recent years yielding heavy information overload. Search engines help the user to find the relevant documents, but do not provide advanced tools for analyzing and understanding the dimensions of text relevant to the users' needs.

The major challenge is the gap between automatically computable text features and the above mentioned needs, which have to be bridged to facilitate the user’s interaction with documents, e.g. understanding why two documents are similar, how the documents are related within an automatically computed cluster, or determining the relevant aspects of text quality and age suitability. VisADoc project aims at developing new visual analytic techniques for closing this gap.

Goals

  • Investigation of novel textual features for modeling content-related text properties
  • Development of an interactive feature engineering approach for complex user-defined semantic properties
  • Development of visual analysis tools that support the exploration of large document collections with respect to a certain text property

Methods

We analyze text according to different aspects determined through automatically computed features and an interactive, visually supported feature engineering approach which allows exploration and evaluation of user-defined text properties in large document collections. These features are then used for advanced text analysis, resulting in an improved effectiveness with higher accuracy.

To this end, we investigate novel textual features for modeling content related text properties. A tight integration of automatic text analysis with multidimensional text and feature visualization is crucial to the proposed interactive process. The research is embedded in an end-to-end framework that supports defining text measures according to users interests.

Below are several examples of our visual semantic exploration of children books:

Flow of Harry's emotions in the chapters of the book Harry Potter and Sorcerer's Stone:

Distribution of Harry's activities (as verb semantic classes) in the book:

Analysis of readability difficulty per paragraph in each chapter of Harry Potter and the Sorcerer's Stone_

Position of activities (verb classes) represented as dense vectors (embeddings) in the semantic space:

Resources

Below, we make openly available some of the resources produced through this project.

Datasets:

Source code:

Other NLP resources:

Various:

Personality profiling in books (fictional character personality assessment game, adapted from psychology questionnaires): http://books.ukp.informatik.tu-darmstadt.de/

Partners

This project is established in cooperation with University of Konstanz.

People

  • Prof. Dr. Iryna Gurevych, Principal Investigator
  • Prof. Dr. Daniel Keim, Principle Investigator, Computer Science Institute, University of Konstanz
  • Dr. Daniela Oelke, Senior Researcher
  • Lucie Flekova, Doctoral Researcher

Funding

This project is funded by Deutsche Forschungsgemeinschaft (German Research Foundation).