Data Science in Social Research

One of the challenges of applying data analytics in sociology is the institutionalization of data science outside of sociology, as the former expertise of sociology was based on its own method of research. Another challenge is epistemological in nature, relates to the noisiness and validity of digital data, and the question of explanation/causation, which is highly important for sociology. These challenges give the background of the tension between the Big Data based, social-related findings and the sociological skepticism questioning the potential of this knowledge-production. The challenges can be solved through the redefinition of the research methodological basis of sociology, by the organic incorporation of data science know-how to its own methods. The solution also needs the combined application of qualitative and quantitative analysis motives, and the use of knowledge-driven science instead of the data-driven approach.

Foregoing results

Our research stream is motivated by a continuously growing social science interest in data science. As an example, see the case of automated text analytics: the following figure shows that the popularity of automated text analytics has been continuously growing in recent years in general and also in each discipline investigated (to access publication data we used Dimensions, Each trend line is growing persistently even after normalizing for the total number of publications in the discipline. The topic’s percentage portion in sociology increased faster than in sciences in general. In summary, automated text analytics is becoming an increasingly recognized approach in sociology.

Discursive framing of depression in online health communities

Depression is a disease of modernity, where societies impose increased responsibility on the individual, while the individual does not have the opportunity to change his or her circumstances (Sik 2018). In this sense, the problem of depression is embedded into the more general problem of the distortion of social integration.
A current question in sociology is how mental disorders are framed by health professionals and by the patients themselves. A related questions is how psychotherapists transform social suffering into suffering related to the self (see e.g. Flick, 2016).

Previous research in this field has been primarily qualitative. Investigators have used qualitative content analysis of offline texts (personal diaries, letters, interviews) to investigate the framing of depression (e.g. Riskind et al, 1989). We believe that there is significant research utility in the application of automated text analysis methods to investigate the framing of depression in online, patient-generated non-clinical texts.

We investigate the potential for NLP techniques in understanding individual framing of depression in online health communities. Framing of depression is a social construction, it defines the meaning of depression, gives a causal explanation of it and can even determine treatment preferences. The current clinical explanations of depression point to biological, psychological and social discourses (e.g. Comer, 2015).

Forum posts are classified into three framing types by applying different supervised learning algorithms, then distribution and mixture patterns of framing types, their influencing contextual/linguistic/topical factors, and dynamics of these features are examined. We addressed the following questions: How are the three main types of framing distributed? In what pattern are they mixed with each other? What contextual factors (type of forum, communicative behavior of author etc.) influence which framing type is utilized?

Corruption in Online Editorial Media

In recent years, members of our research group have published several studies on corruption research in Hungarian and international leading journals as well. These researches were based on survey data. Turning to nun-survey based methods, our research team has conducted two case studies using NLP methods in corruption research in 2018-2019, . The first case study uses the author-topic model. Using the corpus collected by K-Monitor, we identified 25 corruption topics, and analysed the thematization of the corruption on different websites in different times.

In the second case study, we focused on the temporal changes in the topics of corruption, also in the Hungarian online news sites. We used a dynamic topic model for the analysis in the K-Monitor corpus. Based on 26,000 articles, we analyzed the changes in the popularity and content of typical corruption topics for the period 2007-2018. As a result of the model, we found seven well-separated topics. Our study is currently under review in a leading Hungarian sociology journal.

Our previous studies are mainly descriptive, they can serve as a base for further research. In addition to the empirical analysis, we systematically deal with the question what NLP methods can give to corruption research.

We examine the framework of corruption-definition, furthermore the possibilities of automated processing of huge amounts of texts in corruption research and the data analysis and data processing technologies based on them.

In the course of the educational activity related to the project, K-Monitor also brought a corpus on a data-based hackathon organized for students with K-Monitor and Precognox, which students could use to analyze the data we used in our research team.

Online Antisemitism

The level of antisemitism in Hungary has always been among the highest in Europe. Representative surveys show that approximately 33 to 40 per cent of the Hungarian population is antisemitic. Although there has been some fluctuation, the level of antisemitism has remained quite stable. Moreover, we found, based on representative surveys among Hungarian Jews, that although the proportion of those having experienced or witnessed antisemitic acts one year prior to the survey decreased massively from 79 to 58 per cent between 1999 and 2017, the perception of antisemitism severely deteriorated. While in 1999, 37 per cent of Jews thought that antisemitism was strong or very strong in Hungary, in 2017 65 per cent said the same. This high discrepancy between experience and perception is due to several factors, being one of them the spread of online hatred. This fact makes the analysis of online sources necessary.

Due to the vast amount of unstructured online textual data, their examination demands new tools, one of them being Natural Language Processing (NLP). NLP is an interdisciplinary field of research in the intersection of computer science, artificial intelligence, as well as linguistics. In our research, we apply NLP on a massive corpus of recent Hungarian news articles, social media content, and online forum comments. NLP makes possible not only the examination of the structure, the main topics, and actors of overt antisemitism but the identification of underlying subjects and specificities of latent antisemitism.

The layers of political public sphere in Hungary (2001–2020)

A sociological analysis of the official, media-based and lay online public sphere using automated text analytics and critical discourse analysis

A research project supported by NKFIH (National Research, Development and Innovation Office) (K-134428)

Period of support: December 2020-December 2023

Principal Investigator: Renáta Németh

Participants: Ildikó Barna, Péter Csigó, Domonkos Sik (senior researchers), Jakab Buda, Eszter Katona, Árpád Knap, Márton Rakovics, Zsófia Rakovics, Emese Tóth (junior researchers)


The public sphere is the cornerstone of modern representative democracies: it is responsible not only for providing the voters with the necessary information for a deliberate vote but also to keep the administrative system in check not solely from a legal but also from a moral standpoint. In this sense, the prospect of averting those potential distortions and crises which may emerge in democratic systems depend on the quality of the public sphere (Habermas 1975, 1998). The emergence of the online public sphere overlaps with several waves of significant political transformations and reconfigurations of the political field in Hungary. Therefore, Hungary is a particularly rich context for this research.

The overarching aim of this research is to map the Hungarian online political public sphere since the early 2000s. The transformations of the political and public sphere outline the substantive framework of our research. We plan to analyze the different layers of political discourse including the official channels of communication (e.g., parliamentary speech); the various types of political press (e.g., online press, news portals, tabloid), and also the user-generated contents (online comments, forums, blogs and public Facebook posts). We not only intend to analyze the inner discursive content and dynamics of these layers but also the interactions between them. Moreover, we plan to triangulate these discursive processes with existing polling data to gain a deeper understanding of the interactions between political discourse and public opinion.

We plan to consider the content of the discourses (topics discussed) as well as the language usage/framing. We will identify discursive locations where language polarization appears to describe its linguistic features and explain its mechanisms. We will scrutinize the connection between the manifest and latent opinion climate, the former represented by the political discourse and the latter by polling data. We will not only focus on their similarities but also on their discrepancies.

Digital data produced in online public spheres are primarily textual. Such data require analytical tools, which became accessible only recently with the emergence of the field of Natural Language Processing (NLP) capable of processing large-scale textual data in a systematic, automated way. These innovative tools provide suitable depth in results for sociology (Németh and Koltai, 2020). Sociology will exploit the potential of these changes if it can renew its research culture while preserving its critical reflections. Hence it was our mission to plan a research that shows how NLP can be integrated in an organic way into the toolbox of traditional sociological methods. To reach this aim, we plan to combine automated text analytics with not only qualitative discursive analysis but also traditional quantitative statistical methods.


Habermas, Jürgen. 1975. Legitimation Crisis. Boston: Beacon Press.

Habermas, Jürgen. 1998. Between Facts and Norms: Contributions to a Discourse Theory of Law and Democracy. Cambridge: Polity Press.

Németh, Renáta, and Júlia Anna Koltai. 2020. “Sociological knowledge discovery through text analytics”. In Pathways between Social Science and Computational Social Science – Theories, Methods and Interpretations, edited by Rudas Tamás, Péli Gábor. Springer.

Related doctoral researches

Sociological study of language change and -polarization

Doctoral student: Zsófia Rakovics

Advisors: Renáta Németh, PhD and Domonkos Sik, PhD

Analysing the discourse of sustainability in the triad of political publicity, online media platforms and the lay public

Doctoral student: Emese Tóth

Advisor: Balázs János Kocsis, PhD

Explainable Neural Language Models and their Application in Social Sciences

Doctoral student: Jakab Buda

Advisor: Renáta Németh, PhD

Digital Lens

Our research group Revisiting Early Testimonies of Hungarian Jewish Holocaust Survivors through a Digital Lens, or Digital Lens for short was established in 2021. The main objective of our research is the quantitative “automated” and qualitative analysis of protocols made in 1945 by the National Committee for the Care of Deportees (DEGOB), which are the testimony transcripts of previously deported Hungarian Holocaust survivors. In addition to a more precise historical understanding of the DEGOB committee itself, our textual analyses aim to reveal the most important features of the language used by Jewish survivors, the topography of persecution and survival, and the typically gendered experiences.

The Digital Lens research group is engaged in interdisciplinary historical and social history research. The research group works with innovative methods of digital history and computational social science, complementing rather than excluding traditional methods. We have been reading and preparing the protocols made in 1945 by the National Committee for the Care of Deportees (DEGOB), which contain the deportation itineraries of Holocaust survivors and their interviewed recollections. The aim of our project is to analyse the protocols using a new and innovative methodology. In addition to traditional qualitative and quantitative methods, automated text analysis, artificial intelligence and visualisations play an important role in our research.

Our research questions relate to the language of the Holocaust, the topography of persecution, and the different experiences of men and women. The main questions of language and the Holocaust include what the main features of the language used by Jewish Holocaust survivors in the interviews are and how survivors talk about what happened immediately after liberation. Are there differences between different survivor groups? How does the language used by survivors compare with the public discourse of the time, such as the language used in the press of the time?

The other strand of our research interest is the topography of persecution and survival. Where were the survivors deported from? Where were they located during the Holocaust? What characterized the post-liberation period? By what route and how did they return to Hungary?

Our research team focuses on gendered experiences as well. We are interested to know what the different and similar experiences of women and men were. Do men and women tell different stories about their suffering? What differences can be inferred from the different topographical experiences of women and men?

Our research team is also exploring new historical material. The collection of the protocols is not complete, so our aim is to find additional records and documents, either in archives or in family collections. We believe it is important to personalise history and to this end we will seek out survivors and their families who shaped the life and work of DEGOB.

Our results

(#1) Lecture 11 March 2021: History in the home office, National Rabbinical School – Jewish University, Budapest (Ildikó Barna and Alexandra M. Szabó)

(#2) Conference 17 November 2021: Vienna Wiesenthal Institute for Holocaust Studies: Precarious Archives, Precarious Voices Expanding Jewish Narratives from the Margins. Ildikó Barna; Alexandra M. Szabó: Excavating Voices in a Cross-Archival Approach: DEGOB Testimonies Aligning to ITS Documentation

(#3) Presentation: 14 December 2021: Modern Jewish History Seminar. Barna Ildikó: : The DEGOB Collection Through a Digital Lens

(#4) Publication: Alexandra M. Szabó: Discovery of an Unknown Holocaust Testimony. Eastern European Holocaust Studies Interdisciplinary Journal of the BYHMC. Under publication

EuMePo Jean Monnet Network on memory politics

The EuMePo Jean Monnet Network is a research project between 2019 and 2023 funded by the European Union and the Konrad Adenauer Foundation, involving researchers from the University of Victoria, Canada, Adam Mickiewicz University, Poland, the Institute for Political Studies (IEP) at the University of Strasbourg, France, and the Research Center for Computational Social Science at the ELTE Faculty of Social Sciences. As an international collaboration with researchers from Canada, France, and Poland, our aim is to study and analyze the traumas of the 20th century and contemporary memory politics. The EuMePo Jean Monnet Network aims to develop a long-term, transatlantic collaboration based on the study of populist narratives and memory politics practices. Its research on collective memory in Europe aims to understand the roots of today’s memory politics practices and to describe the mechanisms of contemporary populist-nationalist political parties. With the help of the research, we can learn in depth not only about the specificities of Hungarian memory politics, but also about the memory politics practices of Polish, French, German and Canadian societies, and the historical elements that still live in the collective consciousness of the communities.

In addition, EuMePo Jean Monnet Network team aims to make academic work accessible to a wider audience and to develop communication between the academic community and society. To this end, our joint work will not only focus on producing peer-reviewed publications, but also, among other things, on creating a tutorial booklet for secondary school students and making accessible science education videos on various topics.

Each research teams participating in the Research Group works according to its own methodological principles, but the final product is a combination of these methods and theoretical approaches. In our analyzes, we, the team of the RC2S2, rely primarily on NLP (Natural Language Processing) methods, and we aim to provide insights into the practices of memory politics in Hungary by analyzing and processing large text corpuses. Our work has dealt with the two world wars, fascism and the Holocaust, as well as the “legacy” of the communist period and the narrative around these themes.

Detailed information about the results, recordings of the webinars organized by the research group and academic materials are available on the official website of the EUCAnet. (


Barna Ildikó, Knap Árpád: Analysis of the Thematic Structure and Discursive Framing in Articles about Trianon and the Holocaust in the Online Hungarian Press Using LDA Topic Modelling. Nationalities Papers pp. 1-19. 19 p. (2022)

Knap Árpád, Bartha Diána, Barna Ildikó: Trianon és holokauszt emlékezetpolitikai jellegzetességeinek elemzése természetesnyelv feldolgozás használatával. Szociológiai Szemle 31:4 pp. 28-62. 35 p. (2021)