Feature-based Visualization and Analysis of Natural Language Documents (VisADoc)
The amount of digital text data, e.g. created by the Web users, has been rapidly growing over the recent years yielding heavy information overload. Search engines help the user to find the relevant documents, but do not provide advanced tools for analyzing and understanding the dimensions of text relevant to the users' needs.
The major challenge is the gap between automatically computable text features and the above mentioned needs, which have to be bridged to facilitate the user’s interaction with documents, e.g. understanding why two documents are similar, how the documents are related within an automatically computed cluster, or determining the relevant aspects of text quality and age suitability. VisADoc project aims at developing new visual analytic techniques for closing this gap.
- Investigation of novel textual features for modeling content-related text properties
- Development of an interactive feature engineering approach for complex user-defined semantic properties
- Development of visual analysis tools that support the exploration of large document collections with respect to a certain text property
We analyze text according to different aspects determined through automatically computed features and an interactive, visually supported feature engineering approach which allows exploration and evaluation of user-defined text properties in large document collections. These features are then used for advanced text analysis, resulting in an improved effectiveness with higher accuracy.
To this end, we investigate novel textual features for modeling content related text properties. A tight integration of automatic text analysis with multidimensional text and feature visualization is crucial to the proposed interactive process. The research is embedded in an end-to-end framework that supports defining text measures according to users interests.
Below are several examples of our visual semantic exploration of children books:
Flow of Harry's emotions in the chapters of the book Harry Potter and Sorcerer's Stone:
Distribution of Harry's activities (as verb semantic classes) in the book:
Analysis of readability difficulty per paragraph in each chapter of Harry Potter and the Sorcerer's Stone_
Position of activities (verb classes) represented as dense vectors (embeddings) in the semantic space:
Below, we make openly available some of the resources produced through this project.
- German school lesson transcripts, described in this paper: https://www.ukp.tu-darmstadt.de/data/quality-assessment/school-lesson-quality/
- Personality of characters in books, described in this paper: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/sentiment-analysis/Personality-GOLD_characters.tsv
- Wikipedia Article Feedback, described in this paper: https://www.ukp.tu-darmstadt.de/data/quality-assessment/wikipedia-article-feedback/
- Neural networks integrating semantic feature vectors, from this paper: https://github.com/UKPLab/acl2016-supersense-embeddings/tree/master/classification
- Supersense tagger described in this paper: https://github.com/UKPLab/acl2016-supersense-embeddings/tree/master/tagger
- Classifying the quality of German school lessons, framework from this paper: https://github.com/UKPLab/jlcl2015-pythagoras
Other NLP resources:
- Pretrained Wikipedia word and supersense embeddings (from this paper) in W2V format: public.ukp.informatik.tu-darmstadt.de/wikipedia/supersense-embeddings.txt.zip (skip-gram, 300 dimensions, window size = 2, min. frequency = 200).
- Polarity switching sentiment bigrams described in this paper: https://www.ukp.tu-darmstadt.de/data/sentiment-analysis/inverted-polarity-bigrams/
Personality profiling in books (fictional character personality assessment game, adapted from psychology questionnaires): http://books.ukp.informatik.tu-darmstadt.de/
This project is established in cooperation with University of Konstanz.
- Prof. Dr. Iryna Gurevych, Principal Investigator
- Prof. Dr. Daniel Keim, Principle Investigator, Computer Science Institute, University of Konstanz
- Dr. Daniela Oelke, Senior Researcher
- Lucie Flekova, Doctoral Researcher
This project is funded by Deutsche Forschungsgemeinschaft (German Research Foundation).