Natural Language Processing and the Web

Natural Language Processing and the Web

Course content will be available at the NLPWeb Moodle site.

Please note that the content of this website may be subject to change


  • Prof. Dr. Iryna Gurevych
  • Dr. György Szarvas

Please note that the lecture will be held in English

Practice Classes

  • Michael Matuschek

Office hours: Monday, 14:00-15:00, Room S202/C003

Please send an email with your questions until Monday, 12:00!


To pass, each student has to take the written exam at the end of the semester.

There will also be graded assignments in the practice classes which will contribute to your overall grade.


If you plan to participate in this course, please register here.


  • Lecture: Thursday 9:50-11:30, Room S202/C205
  • Practice class: Thursday 11:40-13:10 Room S202/C003
  • If you cannot attend this practice class, you are kindly asked to work on the tasks on your own and use the office hours (see above) for asking questions.


  • The final exam will be on March 1st 2011, 9:50-12:00, in Room S202/C110

Course content

The Web contains more than 10 billion indexable web pages, which can be retrieved via search queries. The lecture will present Natural Language Processing (NLP) methods to (1) automatically process large amounts of unstructured text from the web and (2) analyse the use of Web data as a resource for other NLP tasks.

  • Processing of unstructured web content
    • Introduction
    • NLP Basics – Tokenisation, Part of Speech Tagging, Chunking, Stemming, Lemmatization
    • UIMA-1, i.e. principles
    • UIMA-2, i.e. applications
    • Web contents and their characteristics – diverse genres of web contents, e.g. personal web sites, news sites, blogs, forums, wikis
    • Web contents and their characteristics – continued
    • Web as corpus – innovative use of the web as a very big, distributed, linked, growing and multilingual corpus
    • Web as corpus – continued
  • NLP applications for the web
  • Subjectivity Analysis
  • Information retrieval – introduction to the basics of information retrieval
  • Web information retrieval – natural language interfaces for web information retrieval
  • Question answering
  • Summarization
  • Mining Web 2.0 Sites, such as Wikipedia and Wiktionary
  • Quality Assessment of Web Contents


  • Kai-Uwe Carstensen, Christian Ebert, Cornelia Endriss, Susanne Jekat, Ralf Klabunde, Computerlinguistik und Sprachtechnologie. Eine Einführung, Heidelberg: Spektrum-Verlag, März 2004. (2. Auflage) ISBN 3827414075
  • T. Götz & O. Suhre, Design and implementation of the UIMA Common Analysis System, IBM Systems Journal, 2004, 43, 476-489.
  • Adam Kilgarriff & Gregory Grefenstette, Introduction to the special issue on the web as corpus, Computational Linguistics, MIT Press, 2003, 29, 333-347
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.