Computer-assisted Interactive Extraction of Dictionary Examples from Large Corpora
Dictionaries are an essential resource in many domains of research, education, and natural language processing (NLP). Nowadays, dictionaries are digital resources, often accessible through websites and webservices, thus being a part of the portfolio of modern e-Research technologies and infrastructures.
One crucial part of dictionaries are example sentences which illustrate use cases of for a specific lemma. Research on dictionary use has shown that users even tend to refer to (good) examples instead of consulting the rather complex descriptions of word grammar in dictionaries. However, dictionaries of contemporary languages have to meet two important requirements: being comprehensive and up-to-date; which means that newly emerging words need to be listed including their senses and usages. To do so, lexicographers are required to select usage examples from a large list of candidate sentences.
Recent studies show that existing systems for automatic evaluation of dictionary examples are good in identifying bad ones, but not able to provide a fine-grained scale for potentially good dictionary examples. In this project, we develop a novel system which eases the work of lexicographers by interactively assessing the goodness and diversity of dictionary examples.
The key features of our dictionary example selection system include:
- Providing good estimates for the goodness of an example sentence
- Considering the diversity of yet-to-be-selected examples wrt. to the already selected set of examples
- Interactive adaptation of our system based on a lexicographer's feedback
We unite all features into an interactive lexicographer interface which is adaptable to any language and use case (e.g., the creation of second language learning dictionaries).
The progress of the project will be led by an iterative methodology that encompasses the following:
- Corpus – together with lexicographers from the DWDS we compile a corpus consisting of pairwise annotations between dictionary example sentences and increase it successively.
- Interactive preference learning – Utilizing recent insights from preference learning and sentence classification with contextualized language models, we develop approaches which interactively learn from a lexicographer's feedback to automatically suggest better and more diverse dictionary examples.
- Crowd-sourced feedback – We additionally incorporate feedback from lay-users of the dictionary as an additional signal for our trained models to further improve the quality of automatically extracted dictionary examples.
- Prof. Dr. Iryna Gurevych, Principal Investigator
This project is established in cooperation with the berlin-brandenburgische Akademie der Wissenschaften located in Berlin:
This project is funded by Deutsche Forschungsgemeinschaft (German Research Foundation).
Lee, Ji-Ung ; Meyer, Christian M. ; Gurevych, Iryna (2020):
Empowering Active Learning to Jointly Optimize System and User Demands.
In: The 58th annual meeting of the Association for Computational Linguistics (ACL 2020), virtual Conference, 05.-10.07.2020, S. 4233-4247, [Online-Edition: https://www.aclweb.org/anthology/2020.acl-main.390/],
Simpson, Edwin ; Gurevych, Iryna (2020):
Scalable Bayesian Preference Learning for Crowds.
In: Machine Learning, 109. Springer, S. 689-718, [Online-Edition: https://link.springer.com/article/10.1007/s10994-019-05867-2],