Document Context-Aware Interpretable Sentence Similarity

Bachelor Thesis, Master Thesis

Measuring sentence similarity [1] is a classic topic in natural language processing (NLP). Semantic Textual Similarity (STS) [2] is a well-studied task that measures the equivalence of sentence pairs in terms of meaning by predicting similarity scores, while the idea of interpretable STS (iSTS) [3] is to explain why and how two sentences may be similar/different by supplementing STS with an explanatory text. Previous works on STS and iSTS analyze sentence pairs in an atomic fashion, without knowing the document-level context. The proposed thesis topic is based on the core idea that the meaning of a sentence should be defined by its contexts, and that the sentence similarity could be better determined and explained by taking contexts into consideration. This thesis will construct a document revision dataset containing alignments between sentences pairs with an alignment type and a similarity score. An iSTS system based on advanced sentence Transformer models such as [4] will be trained on this dataset which, given a pair of sentences and their corresponding contexts, explains what is similar and different, in the form of graded and typed sentence alignments. By systematic comparison of various systems with or without knowledge of document context, this thesis will answer the question of whether it is beneficial to measure sentence similarity in contexts.