Investigation into disentanglement of email threads
Thread disentanglement is the task of separating out conversations whose thread structure is implicit, distorted, or lost. Inherent thread structure contained in email headers may be distorted or missing from email threads when the participants wish to prevent third parties, such as law enforcement, from understanding the conversation.
For this project, we produced the Enron Threads corpus, a newly-extracted corpus of 70,178 multi-email threads with emails from the Enron Email Corpus.
In our experiments, we performed email thread disentanglement through pairwise classification, using text similarity measures on non-quoted texts in emails. We found that i) content text similarity metrics outperform style and structure text similarity metrics in both a class-balanced and class-imbalanced setting, and ii) although feature performance is dependent on the semantic similarity of the corpus, content features are still effective even when controlling for semantic similarity.
Figure: An unsorted bag of emails undergoes pairwise classification to identify email threads.
Email Threads Corpus
The Email Threads Corpus is available here.
A paper describing the creation of the ETC corpus and our experiments on pairwise classification for email thread disentanglement can be found here:
Emily K. Jamison and Iryna Gurevych.
Proceedings of 9th Conference on Recent Advances in Natural Language Processing. September 2013. Hissar, Bulgaria.