Spelling Difficulty Prediction

Spelling Difficulty Prediction

If you use the extracted errors, please cite:

Lisa Beinborn, Torsten Zesch, Iryna Gurevych. Predicting the Spelling Difficulty of Words for Language Learners. In: Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications held in conjunction with NAACL 2016, p. 73--83, 2016.

We extracted spelling error from the following corpora:

  • The EFCAMDAT-corpus (abbreviated EFC in the paper): 167,713 spelling errors from learner essays in English.
    If you refer to the EFCAMDAT corpus, please cite:
    J Geertzen, T Alexopoulou, and A Korhonen. Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat). In: Ryan T. Miller, editor, Selected Proceedings of the 2012 Second Language Research Forum. MA: Cascadilla Proceedings Project. 2012
  • The MERLIN-Corpus: 4,971 spelling errors from learner essays in Italian and German (2,525 German errors, 2,446 Italian errors).
    If you refer to the MERLIN corpus, please cite:
    Abel, Andrea; Wisniewski, Katrin; Nicolas, Lionel; Boyd, Adriane; Hana, Jirka; Meurers, Detmar. A Trilingual Learner Corpus illustrating European Reference Levels. In: Ricognizioni – Rivista di Lingue, Letterature e Culture Moderne 2 (1), 111-126. 2014.
    Katrin Wisniewski, Karin Schöne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. In: Proceedings of the Conference ICT for Language Learning 2013, Florence, Italy, November 14-15, 2013.
  • The FCE-Corpus: 4,047 spelling errors from learner essays in English.
    We are currently waiting for the agreement to publish the spelling errors from the FCE Corpus.
    If you refer to the FCE-corpus, please cite:
    Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. A New Dataset and Method for Automatically Grading ESOL Texts. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2012.

For any questions, please contact Lisa Beinborn.

The software used for these experiments is also freely available: