RWSE Wikipedia Revision Dataset
Real-word spelling error datasets mined from the Wikipedia revision history.
Each instance consists of the original sentence with an error and the sentence where the error has been corrected.
An instance also contains the id of the Wikipedia article as well as of the revision, so the instance can be traced back to the original Wikipedia article.
- English errors from Wikipedia (training)
- English errors from Wikipedia (test)
- German errors from Wikipedia (training)
- German errors from Wikipedia (test)
- English artificial errors
- English artificial errors (nouns only)
- German artificial errors
- German artificial errors (nouns only)
If you use the dataset, please cite:
Torsten Zesch. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), April 2012.
There also is a poster on the topic.
The software used for these experiments is also freely available: