Bernadett Csala-Ferencz – Cluster analysis of online depression forum posts – Applying the scatter / gather method on textual data

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bernadett Csala-Ferencz

Cases of depression are increasingly common in our times, and internet forums provide great opportunity to better understand the nature of mental illnesses, and identify severe cases of depression. For the latter, examining the divergent uses of pronouns (such as increased usage of first person singular) is an effective way of identification. For my research I made cluster analysis on 66295 posts from English-speaking forums concerned with the topic of depression, to examine the different groups these posts can be organized into. Getting to know and understand these forums was not the only goal of this research. Methodologically I wanted to find the optimal preprocessing level of the texts and examine if the scatter/gather algoritm can be effectively used to find interpretable clusters. Throughout my work there were 15 clusters identified and it is clear that the applied scatter/gather clustering method was a mostly useful tool to isolate well-interpretable clusters. The usage of the first person singular pronouns helped me discover a cluster in increased risk, but it could be useful to examine the identification of posts with severe cases of depression through other linguistical markers too.

View Thesis