The prediction of the outcome of a (medical) treatment can be conducted considering various information. Depending on the treatment, information can come in different data modalities, e.g., audio or video recording, textual records, or bio-markers. To train an automatic prediction model, it can be beneficial to combine data with different modalities. However, it is important to study the effects of different modality combinations. In particular, the role of textual and audio data as additional diagnostic indicators in therapy will be the focus of this investigation.