Case2vec: Distributed Representations of Event Log Traces for Process Clustering
This work focusses on word embeddings for process data representation. Process data is a record of activities executed within an enterprise, e.g., a procurement process within a business. Usually this data is stored as a collection of traces, or sequences of activities, often amended by a timestamp and an executor, which are examples of attributes to these activities. Word embeddings can be used for clustering these traces, which ultimately serves a better understanding and analysis when it comes to process discovery. The idea is based on a previous framework which only focusses on the activity name. The extended framework developed in this thesis incorporates an additional semantic level by using not only the activity name but also the attributes an activity provides.
The advantage of word embeddings over traditional clustering approaches is the comparison of activities on a syntactic and semantic level. Traditional approaches find similarities of traces based on common traits, e.g., the activity name. Word embeddings create a vector space representation of these traits. Based on co-occurrences within an event log, similar traces are projected to similar regions in the vector space. This means that activities are similar when they occur together frequently, even if the actual activity name is entirely different. This semantic similarity is the core advantage over traditional methods, which rely on similarities only based on a syntactical level.
A comprehensive evaluation of the original framework is provided. Furthermore, it is shown that the extended version leads to a decisive change in the quality of the cluster evaluation. The original framework reached a normalized mutual information (NMI) score of 0.08 on a small real-life dataset which can be increased to 0.12 with proper hyperparameter optimization. The proposed framework incorporating attributes reaches a performance of 0.40 on this dataset. On a larger and more task-suited real-life dataset the original framework leads to performances up to 0.45 with proper hyperparameter optimization, while the proposed approach with incorporated attributes reaches a performance up to 0.93. These results are confirmed by dimensionality reducing visualization techniques and artificially sampled datasets which successfully recreate the coverage results.
Apart from trace clustering, the proposed method can be applied to determine the coverage of certain activities among traces in a process. Also, the framework can be used group similar activities to remove contaminated activity names to enhance data cleansing. Additionally, it is shown that word embeddings allow for interpretability when employing vector space arithmetics. Overall the proposed method of word embeddings for process data shows great potential for trace clustering, detection of anomalous activities and process discovery in general.