Theses – ELTE Research Center for Computational Social Science

Anna Krisztina Kovács – Automated text analysis with no-code tools: a demonstration of Meaning Extraction Helper and AntConc by examining the 2022 Hungarian online emigration discourse

2024 Sociology BA Supervisor Renáta Németh, PhD

Anna Krisztina Kovács (LinkedIn)

The number of studies on this topic in recent years confirms the growing role of automated text analysis within empirical social research. In this thesis, two automated text analysis tools are presented that do not require programming knowledge but can answer relevant sociological research questions. The potentials and limitations of the tools are illustrated through a review of previous studies and through example research conducted in this thesis. In the analysis, I review emigration discourses in lay public opinion following the 2022 parliamentary elections. Using the Meaning Extraction Method, the main themes of the discourses are presented, and using Antconc, the context of the most frequently used words is presented.

Anna Sára Piros – Possibilities and limitations of using BERTopic

2024 Survey Statistics and Data Analytics MSc Supervisor Zsófia Rakovics

Anna Sára Piros (LinkedIn, GitHub)

I present the application and performance of a new topic modelling technique, BERTopic, in comparison to the commonly used LDA model. For a practical comparison, I tested one LDA and two BERTopic models on a corpus of English-language speeches of Prime Minister Viktor Orbán. For the optimized LDA model, I applied fixed settings to one BERTopic model and optimized settings to the other. To evaluate the models, I examined topic coherence and topic diversity indicators, as well as the interpretability of topic representations. The optimised LDA model produced redundant and incoherent topics, while both BERTopic models produced diverse, coherent and specific topics. BERTopic achieves better results, is simpler to use and has a wide range of possibilities thanks to its modular, flexible architecture.

Boglárka Érsek – The usability of Word Embedding models in social research

2024 Sociology BA Supervisor Renáta Németh, PhD

Boglárka Érsek

In my thesis I explore the usability of word embedding models in social science. My aim is to show what kind of studies researchers have used this method for and how they have used it. In my approach, I focus on the “no code” technique, i.e. I investigate the potential for a researcher who does not know how to program to use the method. In my paper, I will first situate the topic within social science research methods, and then describe the essence of the method and its possible uses. By describing previous research, I will show that the method can be used for both technical and content-related applications, as well as for the critical analysis of algorithms based on linguistic models. In addition, I will demonstrate the applicability to Hungarian texts. In my pilot research, I will demonstrate how the method can be used without programming knowledge by using the online word embedding model WebVectors.

Péter Gelányi – Measuring media bias through word embeddings

2024 Survey Statistics and Data Analytics MSc Supervisor Zsófia Rakovics

Péter Gelányi (E-mail)

Word embeddings offer a quantitative representation of words’ semantic relationships. In my thesis, I explore their potential use in studying media bias and slant. The theoretical background of my work is embedded in both the literature on media bias and word embeddings. I detail my analysis of a newly collected Hungarian online media corpus. I fit multiple word embedding models, compare their performance, and use the best one to explore the semantic relationships of specific keywords across mediums and with elements of a sentiment dictionary. My results highlight both the advantages and drawbacks of word embeddings.

Enikő Csaba – Solving the alignment problem of word embedding vector spaces with Procrustes transformations

2023 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Enikő Csaba

The thesis attempts to compare two corpora of articles from online news portals with different social perspectives by matching word embedding vector spaces in order to define the differences resulting from the different contexts. In addition, a further aim of this thesis is to determine the suitability of the Procrustes transform as a tool for matching vector representations in a common space. By creating different word embeddings, the most suitable model for the task is first selected, and then the Procrustes transformations are implemented and evaluated. After selecting the transformation with the lowest approximation error, the fitted vector space is analysed. The results confirm on the one hand that the Procrustes transform is suitable for dealing with the matching problem due to the mismatch of embeddings, and on the other hand, it identifies topic-specific words that appear in different contexts in the two media.

Réka Berbekár – Examining Trianon’s Memory Politics Using Machine Learning and Text Analytics

2022 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Réka Berbekár (LinkedIn, Email)

More than 100 years after signing the Treaty of Trianon, the presence of Trianon in public discourse is still very active. Monuments are unveiled, commemorations are held, and the situation of Hungarians beyond the borders is a constant topic of discussion among journalists and politicians.

In my thesis, I examine whether the style and subject matter of articles on Trianon published on politically different news portals differ. I created topics from the articles using LDA topic modeling and analysed the style using the NarrCat tool (with the help of Tibor Pólya (Eötvös Loránd Research Network, Research Centre for Natural Sciences)). I measured the differences in communication of news portals using the success rate of classification algorithms. Topic affiliation probabilities and NarrCat scores were my explanatory variables, and the political affiliation of the websites publishing the articles was my clustering variable. The best algorithm classified the articles into one of the 4 political groups with 61.2% accuracy, the most important variables in this classification being the topical affiliation scores.

Zsolt Varga – Distance metric learning using Siamese networks for human pose similarity estimation

2022 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Zsolt Varga (LinkedIn)

This thesis proposes the use of deep similarity learning, specifically distance metric learning with a Siamese neural network architecture, to embed human poses into a lower dimensional space for similarity comparison. The goal is to create a map between the original input and the embedding such that the Euclidean distance is small for similar data points and large for dissimilar data points in the embedding space. The approach is shown to be effective in creating a semantic similarity-based human pose embedding that outperforms traditional approaches. The results demonstrate that using these embeddings leads to better classification performance and faster convergence during training. This approach has implications for creating systems that require non-trivial similarity measures, such as invariance to sidedness and the position of body parts, and can serve as input to further models. Overall, this thesis contributes to the development of more advanced techniques for human pose understanding and has potential applications in healthcare, education, fitness, and other fields.

Bendegúz Zaboretzky – Depression and COVID-19 – topic modeling of online forums

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bendegúz Zaboretzky (LinkedIn, GitHub, E-mail)

The key role in a thorough understanding of depression lies with the person who is struggling with it. People in this situation can be effectively approached and examined through online forums regarding depression and related issues. Another recent study has done this excellently, upon which this current work is closely built. The novelty of this research lies in the examination of the impact of COVID-19 and the resulting global pandemic on the discourse of depression. The aim of this paper is to build on previous research, supplement the findings, and continue the line of investigation, taking into account this new effect. As a result, this study is also based on topic modeling and uses NLP (Natural Language Processing) methods – mainly LDA (Latent Dirichlet Allocation) and STM (Structural Topic Models) – to present the results.

The research was carried out in connection with the ELTE RC2S2 research group project, as a continuation of this paper.

Bernadett Csala-Ferencz – Cluster analysis of online depression forum posts – Applying the scatter / gather method on textual data

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bernadett Csala-Ferencz

Cases of depression are increasingly common in our times, and internet forums provide great opportunity to better understand the nature of mental illnesses, and identify severe cases of depression. For the latter, examining the divergent uses of pronouns (such as increased usage of first person singular) is an effective way of identification. For my research I made cluster analysis on 66295 posts from English-speaking forums concerned with the topic of depression, to examine the different groups these posts can be organized into. Getting to know and understand these forums was not the only goal of this research. Methodologically I wanted to find the optimal preprocessing level of the texts and examine if the scatter/gather algoritm can be effectively used to find interpretable clusters. Throughout my work there were 15 clusters identified and it is clear that the applied scatter/gather clustering method was a mostly useful tool to isolate well-interpretable clusters. The usage of the first person singular pronouns helped me discover a cluster in increased risk, but it could be useful to examine the identification of posts with severe cases of depression through other linguistical markers too.

Lilla Békési – Holocaust denial and Holocaust-related distortions on the far-right portal Kuruc.info

2021 Sociology BA Supervisor Ildikó Barna, PhD

Lilla Békési

In my thesis, I examined the phenomenon of Holocaust denial and Holocaust-related distortions in articles and comments published on the far-right portal Kuruc.info. For my thesis, I conducted a qualitative secondary analysis of the texts collected by Ildikó Barna and Árpád Knap, who used topic modelling to research antisemitism on said portal. Using the category system developed by Manfred Gerstenfeld, I sought to answer questions such as which types of Holocaust-related distortions appear on the portal and which are the most frequent. I also investigated whether antisemitic views related to Holocaust distortion are detectable and to what extent users of the portal try to obscure their views. I have also tried to give some insight into the extent to which articles and comments differ in content or wording.

Anna Farkas – Social biases in machine learning: A case study of Google Translate

2020 Sociology BA Supervisor Renáta Németh, PhD

Anna Farkas

In recent years, several studies have been published about the phenomenon that machine learning algorithms are prone to reinforce or amplify human biases. This paper is a case study that investigates gender bias in Google Translate and its translations of occupations from Hungarian (a gender-neutral language) to English (a gender-based language). Using quantitative methods, the study aims to measure the extent of gender bias in machine translations. It examines the use of pronouns in the English translation of sentences such as “ő egy orvos” (“he/she is a doctor”).

To measure the bias in the algorithm, the study compares Google Translate’s translations to the proportion of men and women in each occupation, and to society’s perception of those occupations. To assess whether people find those occupations feminine or masculine, we used an omnibus survey created with the help of Inspira Group research company. The study found that Google Translate mirrors people’s perception of occupations to a greater extent than the proportion of men and women in those occupations.

The paper also includes research about how using attributives such as “good”, “very good”, “bad”, “very bad” in the sentences modify the translations of the pronouns.

Dániel Tóbiás – Analyzing gender disparity on Twitch.tv channels with text mining techniques

2020 Sociology MA

Dániel Tóbiás (LinkedIn, E-mail)

Digitalization has opened a new era and Sociology has got a new set of tools to analyze and survey society. Here I am using one of the tools (text mining) to unfold gender disparity / gendered conversation in an online video game live-streaming platform and to reveal the potential of text mining. As it shows, there are some minor differences between female and male channels, however there is no sign of gender disparity or objectification in the data.

Jakab Buda – Text classification with a recurrent neural network based language model

2020 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Jakab Buda

I study text classification with recurrent neural networks in my thesis, more precisely profiling authors by age and gender with language models. The requirements in this field are continuously changing due to the technological developments and the ever altering forms of online content, therefore in the last couple of years many different solutions have been developed for this task. After a review of the most relevant related natural language processing literature dealing with word embeddings, text classification, and language models I discuss the theoretical background of recurrent neural networks and the most important methodological questions of machine learning. Lastly, I test different models with varying architecture and size on the PAN 2013 author profiling database. The question of the thesis concerns whether a classifier that consists of different models fitted to each class and that labels an item according to the class of the model that fits it the best can be a viable alternative to the standard classifier architectures. Although amongst the models fitted in the thesis these classifiers do not have a better overall performance than those with standard classifier architecture, it seems these models are capable of more balanced performance amongst the different classes.

Krisztián Boros – Meta-analysis of missing data handling methods with text-mining

2020 Survey Statistics and Data Analytics MSC

Krisztián Boros (LinkedIn, GitHub)

The ubiquity of missing data in quantitative research is undeniable. We may encounter with missing data due to, for example, non-response, incorrect sampling, or data processing errors. During the past 50 years, researchers have developed a wide variety of missing data handling methods; the spectrum of available techniques extends from the basic deletion methods (e.g. listwise- and pairwise deletion) to the more involved techniques (e.g. Multiple Imputation, EM-algorithm).

The aim of my thesis is twofold. On one hand, I introduce a text-mining approach to collect and analyze papers while pointing out the advantages and disadvantages of this particular approach using the Total Survey Error Framework. On the other hand, I try to examine the possible trends of the missing data handling methods across years and scientific fields.

The results show that the popularity of advanced techniques (e.g. Multiple Imputation, EM-algorithm) had been growing over the past 20 years, but the not-advanced techniques (e.g. deletion methods, mean imputation) are still in widespread use. In the case of the methodology, several limitations of the text-mining approach were pointed out such as the questionable generalizability and reliability of the results.

Norbert Kerekes – Multi-label classification of online forum posts

2020 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Norbert Kerekes (LinkedIn)

Multi-label classification is a machine learning task seldom mentioned, considering how prevalent the problem is in everyday life.
The thesis is about this problem, aiming to overview and compare algorithms suited to solve multi-label problems. The most important representatives of the two greater algorithm families (problem transformation and adaptive algorithm methods) are presented in a text classification problem. The database contains depression-related online forum entries categorized by the biopsychosocial model.

András Hering – Applications of Random Forest methods

2019 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

András Hering (E-mail)

Machine learning algorithms have emerged as an alternative to mainstream statistics, optimising prediction accuracy to its limits, but are often incomprehensible. Leo Breiman argued for what he called algorithmic culture: the most accurate model is preferred to a worse, but more interpretable one. In my thesis, I use Leo Breiman’s and Adele Cutler’s Random Forest classifier, to evaluate a research concerning learning types, that used logistic regression. My goal is to search for new and already known information provided by the Random Forest model that is expected to provide better accuracy in the field of social sciences, where interpretation is key. After introducing the complex ensemble of decision trees that is Random Forests, I demonstrate the three main sources for evaluating the model: out-of-bag, variable importance, and multi-dimension scaling. During my analysis, I produce a marginally better RF classifier, and I manage to find similarities and differences compared to the original research: one particular similarity is the connection of partial dependence based on class vote proportions of trees and logistic regression coefficients.

Beáta Gallina – Sentiment analysis on articles from online news sites

2019 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Beáta Gallina (LinkedIn, GitHub)

In my thesis I focus on sentiment analysis (SA) on Hungarian online news articles. In this case study, I present the methodological steps of text mining and sentiment analysis – with special emphasis on preprocessing – the most important SA models, then I accomplish a comparative analysis. In addition I contrast two traditional (lexicon and machine learning based) models with the combination of them and use the model with the best performance to answer the following social science themed research questions: To what extent appears emotional attitudes related to political actors in Hungarian online press; has changes happened in the perception of political actors due to the elections on the side of journalists and is there a parallel between the results of traditional popularity polls and the results of SA, more specifically, is there a relationship between the voters’ preferences and the valency of the political actor presence.

After the model evaluation, I worked with Naive Bayes classifier and on the grounds of the outcomes, it can be concluded that the largest sentiment category is neutral, but the dominant class is greatly influenced by which political actor is represented in the given text. The work revealed that election day had an impact on politicians’ connotation in media: most opposition politicians appeared in more negative light in the opposition media after the voting, than before. In case of some parties, there is a similar tendency in polls and SA.

The accuracy of the models could be further enhanced by inclusion of other features – namely topics, n-grams, article authors – a larger training set and a more comprehensive sentiment dictionary.

Keywords: elections, text mining, sentiment analysis, polls, machine learning, Naives Bayes classifier

Balázs Mayer – The effect of homophily on opinion dynamics processes in social networks – agent based social simulation

2018 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Balázs Mayer (E-mail)

I have studied the effect of homophily on opinion dynamics processes in social networks by agent based social simulation. My main hypothesis (based on the findings of Gargiulo and Gandica, 2017) was that greater opinion homophily leads to an increased chance of consensus formation.

Contrary to the original paper where the opinion variable had a random uniform distribution and similarity between agents was only measured in this one direction, my own growing network model considers both the well-known phenomenon of preferential attachment, the homophily of agents by their demographic attributes (derived from ego-network data about the Hungarian society in the 2000s using a case-control framework) and five different (simulated and real-world) opinion distributions, according to which homophily could be tuned.

The resulting graphs could capture the phenomenon of more similar agents being connected with greater probability both according to their opinion and demographical attributes, and networks with increased opinion homophily displayed greater modularity than simple preferential attachment ones. However, the networks created only considering the effects of similarity of demographic attributes did not show increased modularity.

Upon analysing the opinion dynamics processes in the networks the initial hypothesis was confirmed – it seems that the consensus stimulating effect of opinion homophily does not depend greatly on the distribution of the opinion variable, neither does introducing demographic homophily change this association.

References:

Gargiulo, F., Gandica, Y. (2017). The role of homophily in the emergence of opinion controversies. In: Journal of Artificial Societies and Social Simulation, 20 (3)

URL: http://jasss.soc.surrey.ac.uk/20/3/8.html