DIP 2016 Corpus: Focused Retrieval over the Web
Corpus introduced in SIGIR 2016 article “New Collection Announcement: Focused Retrieval Over the Web”
Ivan Habernal and Maria Sukhareva and Fiana Raiber and Anna Shtok and Oren Kurland and Hadar Ronen and Judit Bar-Ilan and Iryna Gurevych. New Collection Announcement: Focused Retrieval Over the Web In: SIGIR '16, Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 701-704, ACM, July 2016.
The corpus is available here:
There are two folders available:
- Contains intermediate data with original plain text, votes from Amazon Mechanical Turk workers, additional instruction to label relevant/irrelevant sentences, etc.
- The final clean exported corpus
- The annotations are licensed under CC-BY 4.0.
- The original content from ClueWeb12 keeps its original license.
- Please cite the SIGIR 2016 article if you use the data in any of your work.
- The software package used for preparing this data can be found at the following GitHub repository