Natural Language Processing and the Web

Natural Language Processing and the Web


  • Prof. Dr. Iryna Gurevych
  • Dr. Delphine Bernhard
  • Dr. Mark-Christoph Müller

Practice Classes

  • Niklas Jakob
  • Christof Müller
  • Cigdem Toprak
  • Torsten Zesch


  • Lecture, Thursday, 13:30 – 15:10, Room S103/23, starting on October 16, 2008
  • Practice class, Thursday, 15:15 – 16:45, Room S202/C003, starting on October 23, 2008
  • Consultation hour, Tuesday, 16:00 – 16:30, Room S202/A213


  • Midterm exam: 04.12.2008, 13:30-15:10, S103/23
  • Final exam: 18.02.2009, 16:00-18:00, S2 02 | C205

Final exam inspection (“Einsicht”) on Thursday April 16, 10:15 – 11:00, S2 02 | A213

Course content

The Web contains more than 10 billion indexable web pages, which can be retrieved via search queries. The lecture will present Natural Language Processing methods to (1) automatically process large amounts of unstructured text from the web and (2) analyse the use of Web data as a resource for other NLP tasks.

  • Processing of unstructured web content
    • Introduction
    • Levels of linguistic analysis: tokenisation, part-of-speech tagging, stemming, lemmatisation, chunking
    • UIMA-1, i.e. principles
    • UIMA-2, i.e. applications
    • Web as corpus part 1, in particular innovative use of the web as a very big, distributed, linked, growing and multilingual corpus
    • Web as corpus part 2
    • Web contents and their characteristics; diverse genres of web contents, e.g. personal web sites, news sites, blogs, forums, wikis
  • NLP applications for the web
    • Opinion Mining and Sentiment Analysis
    • Web information retrieval parts 1 & 2 – natural language interfaces for web information retrieval
    • Question answering
    • Summarization
    • Mining web sites, for instance Wikipedia, Wiktionary
    • Assessment of the quality of web content

For more information please visit the course's homepage on the learning management platform of the TU Darmstadt CLIX (available only to students enrolled at the TU Darmstadt)


  • Kai-Uwe Carstensen, Christian Ebert, Cornelia Endriss, Susanne Jekat, Ralf Klabunde, Computerlinguistik und Sprachtechnologie. Eine Einführung.. Heidelberg: Spektrum-Verlag, März 2004. (2. Auflage) ISBN 3827414075
  • T. Götz & O. Suhre, Design and implementation of the UIMA Common Analysis System, IBM Systems Journal, 2004, 43, 476-489.
  • Adam Kilgarriff & Gregory Grefenstette, Introduction to the special issue on the web as corpus, Computational Linguistics, MIT Press, 2003, 29, 333-347
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.