Vortrag im Rahmen des Informatikkolloquiums

02.04.2019 14:00-15:30

Vortrag im Rahmen des Informatikkolloquiums

Projection-Based Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Evaluation, and Some Misconceptions

02.04.2019, 14:00 Uhr – 15:30 Uhr | TU Darmstadt, Gebäude S2/02, Raum B 002, Hochschulstr. 10, 64289 Darmstadt

Veranstalter: Fachbereich Informatik

Referent: Dr. Goran Glavaš (University of Mannheim)


Cross-lingual word embeddings (CLEs) hold promise of multilingual modeling of meaning and cross-lingual transfer of NLP models. Early models for inducing cross-lingual word vector spaces, requiring sentence- or document-level bilingual signal (i.e., parallel or comparable corpora) have recently been replaced by resource-leaner projection-based CLE models, which require cheap word-level bilingual supervision or even no supervision as all. Despite the ubiquitous usage of CLEs in downstream tasks, they are almost exclusively evaluated intrinsically only on the task of bilingual lexicon induction (BLI). Even BLI evaluations vary greatly, preventing us from correctly interpreting performance and behavior of different CLE models. In this talk, I will present initial steps towards a comprehensive evaluation of cross-lingual word embeddings. I will present results of a systemmatic comparative evaluation of both supervised and unsupervised projection-based CLE models on a large number of language pairs, both in BLI and three diverse downstream tasks, and provide new insights about the ability of cutting-edge CLE models to support cross-lingual NLP. Our study shows that performance of CLE models largely depends on the downstream task and that overfitting CLE models to BLI can severely hurt downstream performance. Finally, I will indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess simple baselines, which display competitive performance in many settings.

Vita: Goran Glavaš is an Assistant Professor for Statistical Natural Language Processing at the Data and Web Science Group, School of Business Informatics and Mathematics, University of Mannheim. He obtained his Ph.D. at the Text Analysis and Knowledge Engineering Lab (TakeLab), Faculty of Electrical Engineering and Computing, University of Zagreb. His research efforts and interests are in the areas of statistical natural language processing (NLP) and information retrieval (IR), with focus on lexical and computational semantics, multi-lingual and cross-lingual NLP and IR, information extraction and NLP applications for social sciences. Goran has (co-)authored over 50 publications in the areas of NLP and IR, publishing at top-tier NLP and IR venues (ACL, EMNLP, NAACL, EACL, SIGIR). He is a co-organizer of the TextGraphs workshop series on Graph-Based NLP and has served as a program committee member / reviewer for renowned journals (Computational Linguistics, Artificial Intelligence, Natural Language Engineering, Information Retrieval Journal) and conferences (ACL, EMNLP, AAAI, IJCAI, SIGIR) in the field.

Organisation: Prof. Dr. Iryna Gurevych (gurevych@ukp.informatik.tu-darmstadt.de,
Tel: 06151 / 16-5411)

zur Liste