Theses – ELTE Research Center for Computational Social Science

András Richárd Wernigg – Application of Monte Carlo Simulation in the Analysis of Healthcare Quality and Risks: A Complex Decision-Support Model Using Hernia Surgery as an Example

2026 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

András Richárd Wernigg (LinkedIn)

This research examines the quality and accessibility of hernia care in Hungary using stochastic Monte Carlo simulation. The model, based on 2022 baseline data, demonstrates the dangers of deterministic capacity planning and the “error of averages.” The 10-year forecast highlights the burden caused by an aging population, regional disparities, and the volume-limiting effect of the HBCS system. Stochastic modeling provides transparent decision support for optimizing scarce capacities.

Balázs Tobak – Development and Comparison of Various Machine Learning Models for Decision Support in Predicting Formula 1 Overtaking Attempts

2026 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Balázs Tobak

This thesis examines the prediction of the outcomes of Formula 1 overtaking attempts based on data from 2018 to 2025. I developed and compared three models (logistic regression, XGBoost, and an Entity Embedding neural network) in terms of predictive accuracy and real-time decision support effectiveness. The results show that XGBoost is the most accurate and stable. The research highlights the importance of interpretability and the dominance of dynamic factors.

Dávid Angyalffy – An Analysis of the Language Used on index.hu Following the 2020 Change in Ownership Using Machine Learning Methods

2026 Survey Statistics and Data Analytics MSc Supervisor Jakab Buda

Dávid Angyalffy (LinkedIn, GitHub, E-mail)

The polarization of the Hungarian online media landscape and the quantitative assessment of changes in editorial policy are current issues in digital journalism research. This study analyzes whether a change in the language of the Index.hu portal can be detected following the 2020 change in ownership and editorial leadership. The analysis employs a proxy-based approach. In this process, machine learning models are trained on articles from two news portals representing different editorial policies—HVG and Origo—and these models are then applied to Index articles. The analysis compares the performance of three model architectures—logistic regression, XGBoost, and BiLSTM with Hungarian fastText embeddings—supplemented by SHAP- and LIME-based interpretability analyses. Statistical validation was performed using the Mann-Whitney U test and Cohen’s d effect size. The results suggest that the average P(Origo) value of Index articles from 2021 is consistently higher than that of articles from 2019 for all three models, indicating a systematic shift toward the Origo stylistic pole.

János Sebestyén Pap – An Analysis of Political Polarization in Hungarian Online Media Through the Lens of Mentions of Individuals

2026 Survey statisztika és adatanalitika MSc Supervisor Zsófia Rakovics

János Sebestyén Pap (LinkedIn, E-mail)

This thesis examines political polarization in Hungarian online news media through patterns of mentions of individuals’ names. The research is based on the hypothesis that polarization manifests not only in opinions but also in which public figures the media highlight over others, and what structural patterns and groupings can be observed among them. The empirical analysis is based on a text corpus comprising articles published between 1998 and 2022. The study combines natural language processing and network analysis methods, using named entity recognition to identify individuals and then constructing news site–person type-pair networks. Relationships are filtered using the comparative advantage index and evaluated using a null-model-based approach with the help of a pairwise configuration model. The study examines the differentiation of news sites based on co-occurrence within and between groups, as well as a polarization index calculated from these data. The results indicate significant polarization following 2017, which is consistent with previous research on the political fragmentation of the Hungarian media system. This study contributes to the network-based measurement of media polarization; however, its limitations include the accuracy of entity recognition and the representativeness of the data.

Levente Kander – The Application of Contrastive Learning to Tabular Data on Patients with Aortic Valve Stenosis

2026 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Levente Kander

This thesis attempts to more accurately map the patterns underlying tabular data from patients with aortic valve stenosis by applying a self-supervised contrastive learning technique, thereby facilitating the creation of a more robust patient segmentation. First, we present the theoretical background of contrastive learning, followed by the structure and operation of the TabContrast method. The clusterability of the embeddings generated by the encoder was evaluated both visually and based on the silhouette metric. Based on the results, it was found that, compared to the original raw data, TabContrast provides better clusterability, alongside a structure that reflects general clinical and cardiovascular risk differences. However, there is no significant difference in accuracy between the random forest models trained on the embedded data and those trained on the original vector space when classifying calcium scores measured on the aortic valve.

Mátyás Jakab Keindl – Estimating Turnout Data for Local Elections Using Spatial Models

2026 Survey Statistics and Data Analytics MSc Supervisor Renáta Németh, PhD

Mátyás Jakab Keindl

In my thesis, I conduct a comparative analysis of two methods with different approaches, using a standard set of variables, with the goal of estimating voter turnout. The question is whether the more modern machine learning method (XGBoost) performs better for this task, or whether the well-established regression (OLS) method remains dominant. Of course, it is also possible that there is no significant difference between the two. Although, in general—even when observing the media—parliamentary elections that determine the composition of the National Assembly receive greater attention (Bódi and Bódi, 2011), from the perspective of this analysis, municipal elections are a more suitable case for me due to their recent nature (the most recent municipal election was in 2024, while the parliamentary election was in 2022). Based on this, I will model voter turnout in mayoral elections during this research. My units of observation will be data aggregated at the municipal level—as opposed to alternative data at the county or electoral district level—since this will provide me with a sufficient number of observations to meet the conditions necessary for the models to function. Furthermore, this is the most detailed level of data available to me for the variables I intend to use.

Supervisors: Károly Bozsonyi and Renáta Németh

Péter Sipos – The Predictive Utility of Inter-Match Correlations in the Premier League: Opportunities and Limitations Using Publicly Available Data

2026 Survey Statistics and Data Analytics MSc Supervisor Jakab Buda

Péter Sipos (LinkedIn)

This research examines whether a valid fatigue metric can be created based on publicly available event and injury data, and whether the predictive accuracy for Premier League matches can be improved by incorporating fatigue, form, and historical match data. The analysis, covering the 2015/16–2024/25 seasons, employs a RAPM model, ELO ratings, xG-based form, and per-minute workload metrics within an ordinal logistic regression and XGBoost framework. The results indicate that no reliable correlation can be established between workload metrics and injuries, and that the inclusion of contextualized variables does not provide any significant predictive advantage over the baseline model in either the player performance or match outcome models.

Alexandra Fodor – The potential applications of large language models in text analysis annotation

2025 Survey Statistics and Data Analytics MSc Supervisor Eszter Katona, PhD

Alexandra Fodor

The thesis examines the applicability of generative large language models in text analysis annotation tasks using a corpus of texts related to depression. The research compares the performance of the closed-source GPT-4o mini and the open-source Llama 3.3 70B, comparing the results of zero-shot and few-shot techniques for both models. In terms of accuracy, the few-shot approach led to a slight improvement over the zero-shot technique. Overall, the Llama model performed slightly better than GPT. The two models performed moderately in terms of accuracy, but their consistency and reliability can be considered high.

Balázs Gályász – Interpretations of Trianon in the political sphere: Narrative clusters and thematic differences

2025 Sociology BA Supervisor Ildikó Barna, PhD

Balázs Gályász

One of the reasons for the increasing dominance of the ruling party is the memory politics it has developed. A key element of this is the Treaty of Trianon, a historical event that continues to live on in the present. In my thesis, I examined the political narratives surrounding the collective memory of the Trianon Peace Treaty. I analyzed a database of online newspaper articles using cluster analysis and the NarrCat toolkit. In my analysis, I compared pro-government, left-liberal, and far-right discourses, revealing their narrative differences across various topics.

János Zsolt Makláry – Classification of research abstracts generated by artificial intelligence and written by humans

2025 Survey Statistics and Data Analytics MSc Supervisor Jakab Buda

János Zsolt Makláry (LinkedIn, E-mail)

The paper examines the accuracy with which classical machine learning algorithms can recognize content generated by artificial intelligence compared to a modern transformer-based detector, and also discusses the impact of AI on the academic environment.

Máté Könye – Comparison of the performance of LSTM and GRU neural networks in classifying fake news using different pre-processing strategies

2025 Survey Statistics and Data Analytics MSc Supervisor Jakab Buda

Máté Könye (LinkedIn, GitHub)

The spread of fake news poses significant public health, social and political risks. The aim of this study is to compare the performance of two advanced recurrent neural network architectures, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), in a binary fake news classification task. The models were evaluated using various text preprocessing strategies (e.g., lemmatization, stopword handling, numerical data conversion) using GloVe word embeddings. The analysis used several independent and thematically diverse English-language news corpora as training and test sets. The results suggest that certain preprocessing steps, such as converting numbers to text form and retaining stopwords, can significantly improve predictive performance. GRU models performed better on test sets containing articles from 2016, while the LSTM architecture proved to be more reliable and accurate on the most recent news articles from 2025. The results highlight the importance of the interaction between neural architectures and preprocessing methods and may point the way to the development of more effective automated fake news filtering systems.

Péter Ódor – Modeling the lowest-cost routes in spatial analysis for archaeological purposes: Possibilites for reconstructing the late medieval road network of Tolna county

2025 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Péter Ódor (Academia.edu)

The thesis is a complex spatial analysis that attempts to partially reconstruct the road network of Tolna County in the late Middle Ages using a widely used modeling tool, the lowest cost path (LCP) calculation. The reference points for the modeling are the thoroughly researched medieval settlement history of the area under study, landscape archaeological observations from recent decades, and medieval roads identified based on historical sources. In the preparatory phase, the most important factors were the well-founded modeling of the main road cost factors, the selection of cost functions used for the calculation, the characteristics of the geographical environment, and the setting of the road search. The LCP roads authentically represented the identifiable road network elements between neighboring medieval settlements. Based on the modeled roads, an overview reconstruction was realized, but there are several opportunities for further development of the method and deepening of the analysis.

The thesis’ appendix can be accessed here.

Tamás Varga – Word embedding, knowledge graphs, and GAT neural network application: predicting the severity of acute pancreatitis

2025 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Tamás Varga (LinkedIn)

In the thesis, the author attempted to predict the severity of acute pancreatitis using a graph-based neural network that employs an attention mechanism in order to present a framework and deep learning model that enables the predictive analysis of tabular data using a graph-based approach. The thesis describes in detail the methodology and application of the Graph Attention Network model, with a particular focus on presenting the most optimal parameter settings and decision points for the research problem. In order to put the results of the neural network-based analysis into context, the author analyzed the research problem using machine learning methods and compared the results with those of the deep learning algorithm. As a result, the thesis contributes to the methodological and practical understanding of the deep learning model based on the attention mechanism discussed.

Anna Krisztina Kovács – Automated text analysis with no-code tools: a demonstration of Meaning Extraction Helper and AntConc by examining the 2022 Hungarian online emigration discourse

2024 Sociology BA Supervisor Renáta Németh, PhD

Anna Krisztina Kovács (LinkedIn)

The number of studies on this topic in recent years confirms the growing role of automated text analysis within empirical social research. In this thesis, two automated text analysis tools are presented that do not require programming knowledge but can answer relevant sociological research questions. The potentials and limitations of the tools are illustrated through a review of previous studies and through example research conducted in this thesis. In the analysis, I review emigration discourses in lay public opinion following the 2022 parliamentary elections. Using the Meaning Extraction Method, the main themes of the discourses are presented, and using Antconc, the context of the most frequently used words is presented.

Anna Sára Piros – Possibilities and limitations of using BERTopic

2024 Survey Statistics and Data Analytics MSc Supervisor Zsófia Rakovics

Anna Sára Piros (LinkedIn, GitHub)

I present the application and performance of a new topic modelling technique, BERTopic, in comparison to the commonly used LDA model. For a practical comparison, I tested one LDA and two BERTopic models on a corpus of English-language speeches of Prime Minister Viktor Orbán. For the optimized LDA model, I applied fixed settings to one BERTopic model and optimized settings to the other. To evaluate the models, I examined topic coherence and topic diversity indicators, as well as the interpretability of topic representations. The optimised LDA model produced redundant and incoherent topics, while both BERTopic models produced diverse, coherent and specific topics. BERTopic achieves better results, is simpler to use and has a wide range of possibilities thanks to its modular, flexible architecture.

Boglárka Érsek – The usability of Word Embedding models in social research

2024 Sociology BA Supervisor Renáta Németh, PhD

Boglárka Érsek

In my thesis I explore the usability of word embedding models in social science. My aim is to show what kind of studies researchers have used this method for and how they have used it. In my approach, I focus on the “no code” technique, i.e. I investigate the potential for a researcher who does not know how to program to use the method. In my paper, I will first situate the topic within social science research methods, and then describe the essence of the method and its possible uses. By describing previous research, I will show that the method can be used for both technical and content-related applications, as well as for the critical analysis of algorithms based on linguistic models. In addition, I will demonstrate the applicability to Hungarian texts. In my pilot research, I will demonstrate how the method can be used without programming knowledge by using the online word embedding model WebVectors.

Péter Gelányi – Measuring media bias through word embeddings

2024 Survey Statistics and Data Analytics MSc Supervisor Zsófia Rakovics

Péter Gelányi (E-mail)

Word embeddings offer a quantitative representation of words’ semantic relationships. In my thesis, I explore their potential use in studying media bias and slant. The theoretical background of my work is embedded in both the literature on media bias and word embeddings. I detail my analysis of a newly collected Hungarian online media corpus. I fit multiple word embedding models, compare their performance, and use the best one to explore the semantic relationships of specific keywords across mediums and with elements of a sentiment dictionary. My results highlight both the advantages and drawbacks of word embeddings.

Enikő Csaba – Solving the alignment problem of word embedding vector spaces with Procrustes transformations

2023 Survey Statistics and Data Analytics MSc Supervisor Márton Rakovics

Enikő Csaba

The thesis attempts to compare two corpora of articles from online news portals with different social perspectives by matching word embedding vector spaces in order to define the differences resulting from the different contexts. In addition, a further aim of this thesis is to determine the suitability of the Procrustes transform as a tool for matching vector representations in a common space. By creating different word embeddings, the most suitable model for the task is first selected, and then the Procrustes transformations are implemented and evaluated. After selecting the transformation with the lowest approximation error, the fitted vector space is analysed. The results confirm on the one hand that the Procrustes transform is suitable for dealing with the matching problem due to the mismatch of embeddings, and on the other hand, it identifies topic-specific words that appear in different contexts in the two media.

Réka Berbekár – Examining Trianon’s Memory Politics Using Machine Learning and Text Analytics

2022 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Réka Berbekár (LinkedIn, Email)

More than 100 years after signing the Treaty of Trianon, the presence of Trianon in public discourse is still very active. Monuments are unveiled, commemorations are held, and the situation of Hungarians beyond the borders is a constant topic of discussion among journalists and politicians.

In my thesis, I examine whether the style and subject matter of articles on Trianon published on politically different news portals differ. I created topics from the articles using LDA topic modeling and analysed the style using the NarrCat tool (with the help of Tibor Pólya (Eötvös Loránd Research Network, Research Centre for Natural Sciences)). I measured the differences in communication of news portals using the success rate of classification algorithms. Topic affiliation probabilities and NarrCat scores were my explanatory variables, and the political affiliation of the websites publishing the articles was my clustering variable. The best algorithm classified the articles into one of the 4 political groups with 61.2% accuracy, the most important variables in this classification being the topical affiliation scores.

Zsolt Varga – Distance metric learning using Siamese networks for human pose similarity estimation

2022 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Zsolt Varga (LinkedIn)

This thesis proposes the use of deep similarity learning, specifically distance metric learning with a Siamese neural network architecture, to embed human poses into a lower dimensional space for similarity comparison. The goal is to create a map between the original input and the embedding such that the Euclidean distance is small for similar data points and large for dissimilar data points in the embedding space. The approach is shown to be effective in creating a semantic similarity-based human pose embedding that outperforms traditional approaches. The results demonstrate that using these embeddings leads to better classification performance and faster convergence during training. This approach has implications for creating systems that require non-trivial similarity measures, such as invariance to sidedness and the position of body parts, and can serve as input to further models. Overall, this thesis contributes to the development of more advanced techniques for human pose understanding and has potential applications in healthcare, education, fitness, and other fields.

Bendegúz Zaboretzky – Depression and COVID-19 – topic modeling of online forums

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bendegúz Zaboretzky (LinkedIn, GitHub, E-mail)

The key role in a thorough understanding of depression lies with the person who is struggling with it. People in this situation can be effectively approached and examined through online forums regarding depression and related issues. Another recent study has done this excellently, upon which this current work is closely built. The novelty of this research lies in the examination of the impact of COVID-19 and the resulting global pandemic on the discourse of depression. The aim of this paper is to build on previous research, supplement the findings, and continue the line of investigation, taking into account this new effect. As a result, this study is also based on topic modeling and uses NLP (Natural Language Processing) methods – mainly LDA (Latent Dirichlet Allocation) and STM (Structural Topic Models) – to present the results.

The research was carried out in connection with the ELTE RC2S2 research group project, as a continuation of this paper.

Bernadett Csala-Ferencz – Cluster analysis of online depression forum posts – Applying the scatter / gather method on textual data

2021 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Bernadett Csala-Ferencz

Cases of depression are increasingly common in our times, and internet forums provide great opportunity to better understand the nature of mental illnesses, and identify severe cases of depression. For the latter, examining the divergent uses of pronouns (such as increased usage of first person singular) is an effective way of identification. For my research I made cluster analysis on 66295 posts from English-speaking forums concerned with the topic of depression, to examine the different groups these posts can be organized into. Getting to know and understand these forums was not the only goal of this research. Methodologically I wanted to find the optimal preprocessing level of the texts and examine if the scatter/gather algoritm can be effectively used to find interpretable clusters. Throughout my work there were 15 clusters identified and it is clear that the applied scatter/gather clustering method was a mostly useful tool to isolate well-interpretable clusters. The usage of the first person singular pronouns helped me discover a cluster in increased risk, but it could be useful to examine the identification of posts with severe cases of depression through other linguistical markers too.

Lilla Békési – Holocaust denial and Holocaust-related distortions on the far-right portal Kuruc.info

2021 Sociology BA Supervisor Ildikó Barna, PhD

Lilla Békési

In my thesis, I examined the phenomenon of Holocaust denial and Holocaust-related distortions in articles and comments published on the far-right portal Kuruc.info. For my thesis, I conducted a qualitative secondary analysis of the texts collected by Ildikó Barna and Árpád Knap, who used topic modelling to research antisemitism on said portal. Using the category system developed by Manfred Gerstenfeld, I sought to answer questions such as which types of Holocaust-related distortions appear on the portal and which are the most frequent. I also investigated whether antisemitic views related to Holocaust distortion are detectable and to what extent users of the portal try to obscure their views. I have also tried to give some insight into the extent to which articles and comments differ in content or wording.

Anna Farkas – Social biases in machine learning: A case study of Google Translate

2020 Sociology BA Supervisor Renáta Németh, PhD

Anna Farkas

In recent years, several studies have been published about the phenomenon that machine learning algorithms are prone to reinforce or amplify human biases. This paper is a case study that investigates gender bias in Google Translate and its translations of occupations from Hungarian (a gender-neutral language) to English (a gender-based language). Using quantitative methods, the study aims to measure the extent of gender bias in machine translations. It examines the use of pronouns in the English translation of sentences such as “ő egy orvos” (“he/she is a doctor”).

To measure the bias in the algorithm, the study compares Google Translate’s translations to the proportion of men and women in each occupation, and to society’s perception of those occupations. To assess whether people find those occupations feminine or masculine, we used an omnibus survey created with the help of Inspira Group research company. The study found that Google Translate mirrors people’s perception of occupations to a greater extent than the proportion of men and women in those occupations.

The paper also includes research about how using attributives such as “good”, “very good”, “bad”, “very bad” in the sentences modify the translations of the pronouns.

Dániel Tóbiás – Analyzing gender disparity on Twitch.tv channels with text mining techniques

2020 Sociology MA

Dániel Tóbiás (LinkedIn, E-mail)

Digitalization has opened a new era and Sociology has got a new set of tools to analyze and survey society. Here I am using one of the tools (text mining) to unfold gender disparity / gendered conversation in an online video game live-streaming platform and to reveal the potential of text mining. As it shows, there are some minor differences between female and male channels, however there is no sign of gender disparity or objectification in the data.

Jakab Buda – Text classification with a recurrent neural network based language model

2020 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

Jakab Buda

I study text classification with recurrent neural networks in my thesis, more precisely profiling authors by age and gender with language models. The requirements in this field are continuously changing due to the technological developments and the ever altering forms of online content, therefore in the last couple of years many different solutions have been developed for this task. After a review of the most relevant related natural language processing literature dealing with word embeddings, text classification, and language models I discuss the theoretical background of recurrent neural networks and the most important methodological questions of machine learning. Lastly, I test different models with varying architecture and size on the PAN 2013 author profiling database. The question of the thesis concerns whether a classifier that consists of different models fitted to each class and that labels an item according to the class of the model that fits it the best can be a viable alternative to the standard classifier architectures. Although amongst the models fitted in the thesis these classifiers do not have a better overall performance than those with standard classifier architecture, it seems these models are capable of more balanced performance amongst the different classes.

Krisztián Boros – Meta-analysis of missing data handling methods with text-mining

2020 Survey Statistics and Data Analytics MSC

Krisztián Boros (LinkedIn, GitHub)

The ubiquity of missing data in quantitative research is undeniable. We may encounter with missing data due to, for example, non-response, incorrect sampling, or data processing errors. During the past 50 years, researchers have developed a wide variety of missing data handling methods; the spectrum of available techniques extends from the basic deletion methods (e.g. listwise- and pairwise deletion) to the more involved techniques (e.g. Multiple Imputation, EM-algorithm).

The aim of my thesis is twofold. On one hand, I introduce a text-mining approach to collect and analyze papers while pointing out the advantages and disadvantages of this particular approach using the Total Survey Error Framework. On the other hand, I try to examine the possible trends of the missing data handling methods across years and scientific fields.

The results show that the popularity of advanced techniques (e.g. Multiple Imputation, EM-algorithm) had been growing over the past 20 years, but the not-advanced techniques (e.g. deletion methods, mean imputation) are still in widespread use. In the case of the methodology, several limitations of the text-mining approach were pointed out such as the questionable generalizability and reliability of the results.

Norbert Kerekes – Multi-label classification of online forum posts

2020 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Norbert Kerekes (LinkedIn)

Multi-label classification is a machine learning task seldom mentioned, considering how prevalent the problem is in everyday life.
The thesis is about this problem, aiming to overview and compare algorithms suited to solve multi-label problems. The most important representatives of the two greater algorithm families (problem transformation and adaptive algorithm methods) are presented in a text classification problem. The database contains depression-related online forum entries categorized by the biopsychosocial model.

András Hering – Applications of Random Forest methods

2019 Survey Statistics and Data Analytics MSC Supervisor Márton Rakovics

András Hering (E-mail)

Machine learning algorithms have emerged as an alternative to mainstream statistics, optimising prediction accuracy to its limits, but are often incomprehensible. Leo Breiman argued for what he called algorithmic culture: the most accurate model is preferred to a worse, but more interpretable one. In my thesis, I use Leo Breiman’s and Adele Cutler’s Random Forest classifier, to evaluate a research concerning learning types, that used logistic regression. My goal is to search for new and already known information provided by the Random Forest model that is expected to provide better accuracy in the field of social sciences, where interpretation is key. After introducing the complex ensemble of decision trees that is Random Forests, I demonstrate the three main sources for evaluating the model: out-of-bag, variable importance, and multi-dimension scaling. During my analysis, I produce a marginally better RF classifier, and I manage to find similarities and differences compared to the original research: one particular similarity is the connection of partial dependence based on class vote proportions of trees and logistic regression coefficients.

Beáta Gallina – Sentiment analysis on articles from online news sites

2019 Survey Statistics and Data Analytics MSC Supervisor Renáta Németh, PhD

Beáta Gallina (LinkedIn, GitHub)

In my thesis I focus on sentiment analysis (SA) on Hungarian online news articles. In this case study, I present the methodological steps of text mining and sentiment analysis – with special emphasis on preprocessing – the most important SA models, then I accomplish a comparative analysis. In addition I contrast two traditional (lexicon and machine learning based) models with the combination of them and use the model with the best performance to answer the following social science themed research questions: To what extent appears emotional attitudes related to political actors in Hungarian online press; has changes happened in the perception of political actors due to the elections on the side of journalists and is there a parallel between the results of traditional popularity polls and the results of SA, more specifically, is there a relationship between the voters’ preferences and the valency of the political actor presence.

After the model evaluation, I worked with Naive Bayes classifier and on the grounds of the outcomes, it can be concluded that the largest sentiment category is neutral, but the dominant class is greatly influenced by which political actor is represented in the given text. The work revealed that election day had an impact on politicians’ connotation in media: most opposition politicians appeared in more negative light in the opposition media after the voting, than before. In case of some parties, there is a similar tendency in polls and SA.

The accuracy of the models could be further enhanced by inclusion of other features – namely topics, n-grams, article authors – a larger training set and a more comprehensive sentiment dictionary.

Keywords: elections, text mining, sentiment analysis, polls, machine learning, Naives Bayes classifier