Text Analytics

Text Analytics: Unsupervised Methods in NLP

Description

In times of increased availability of text resources and computational power, it becomes possible to identify structure in language data without manual annotation. Be it in the context of search engines or machine translation, models that are learned from data alone can significantly improve performance for natural language tasks by folding in background knowledge of language hat has been acquired by only looking at (a large amount of) text.

For this, we have to review concepts of language statistics, language modeling and clustering, as they form the backbone of many works in this area.

Selected topics include: Unsupervised part-of-speech tagging, unsupervised morphology, word sense induction, acquisition of semantic relations, topic models, unsupervised parsing, latent semantic analysis.

The seminar provides detailed coverage of current techniques, their strengths and limitations, and current research directions by including recent research papers. In the course of the seminar, students will acquire key skills like the fundamentals in academic research and scientific writing, and they will be encouraged to improve their presentation skills.

This seminar is being held in the format of a Mini-Workshop: After an introductory lecture, individual topics are assigned, Introductory literature is provided by topic. Students write a paper, consisting of a literature overview and a description of an own experiment. Papers are mutually peer-reviewed. In a final workshop, the work is presented in a 20 minute presentation.

Literature

Literature is distributed by topic.

Expectations

Each student is expected to

  • submit a term paper
  • review other student papers
  • give a 15 min. talk in class + 5 min. Q&A afterwards
  • hand in a revised final version of your paper

Materials and Forum

The course management system is used as the primary communication platform for the seminar and also contains any related material. The access key will be provided in the first seminar session.

For general advice on presenting your topic, please have a look at these guidelines.

Timetable

The seminar takes place Tuesdays, 15:20 – 17:00, S2|02 C120.

  • Introductory session 18.10.2011
  • Topic presentation and assignment 25.10.2011
  • Term paper deadline 20.12.2011
  • Reviews due 10.1.2012
  • Mini-Workshop TBA, 7.2.2012
  • Final paper deadline 14.2.2012

Lecturer

  • Prof. Dr. Chris Biemann

Proceedings of the Workshop

The papers and the presentation slides of all participants that have given their permission to publish their materials are given below.

  • Chris Biemann: Introduction to the workshop (slides)
  • Jens Haase: Punctuation Correction with N-grams
  • Irina Alles: Punctuation Correction using Web Counts (paper) (slides)
  • Richard Steuer: Distributional Similarity for Text Similarity (paper) (slides)
  • Johannes Schwandke: Bestimmung von semantisch ähnlichen Worten mit Hilfe von Kookkurrenzen und Wortstämmen (paper) (slides)
  • Tobias Krönke: Co-occurrence relations for Text Segmentation (paper) (slides)
  • Leander Baumann: TextTiling with Cooccurrence Data
  • Konstantin Tennhard: Clustering of Movie Subtitles
  • David Kaufmann: Weitläufig überwachte Extraktion von Relationen mittels Freebase Relationen (paper) (slides)
  • Marcel Ackermann: Distant Supervised Relation Extraction with Wikipedia and Freebase (paper) (slides)