Data Science in Social Research

One of the challenges of applying data analytics in sociology is the institutionalization of data science outside of sociology, as the former expertise of sociology was based on its own method of research. Another challenge is epistemological in nature, relates to the noisiness and validity of digital data, and the question of explanation/causation, which is highly important for sociology. These challenges give the background of the tension between the Big Data based, social-related findings and the sociological skepticism questioning the potential of this knowledge-production. The challenges can be solved through the redefinition of the research methodological basis of sociology, by the organic incorporation of data science know-how to its own methods. The solution also needs the combined application of qualitative and quantitative analysis motives, and the use of knowledge-driven science instead of the data-driven approach.

Foregoing results

Our research stream is motivated by a continuously growing social science interest in data science. As an example, see the case of automated text analytics: the following figure shows that the popularity of automated text analytics has been continuously growing in recent years in general and also in each discipline investigated (to access publication data we used Dimensions, Each trend line is growing persistently even after normalizing for the total number of publications in the discipline. The topic’s percentage portion in sociology increased faster than in sciences in general. In summary, automated text analytics is becoming an increasingly recognized approach in sociology.

Discursive framing of depression in online health communities

Depression is a disease of modernity, where societies impose increased responsibility on the individual, while the individual does not have the opportunity to change his or her circumstances (Sik 2018). In this sense, the problem of depression is embedded into the more general problem of the distortion of social integration.
A current question in sociology is how mental disorders are framed by health professionals and by the patients themselves. A related questions is how psychotherapists transform social suffering into suffering related to the self (see e.g. Flick, 2016).

Previous research in this field has been primarily qualitative. Investigators have used qualitative content analysis of offline texts (personal diaries, letters, interviews) to investigate the framing of depression (e.g. Riskind et al, 1989). We believe that there is significant research utility in the application of automated text analysis methods to investigate the framing of depression in online, patient-generated non-clinical texts.

We investigate the potential for NLP techniques in understanding individual framing of depression in online health communities. Framing of depression is a social construction, it defines the meaning of depression, gives a causal explanation of it and can even determine treatment preferences. The current clinical explanations of depression point to biological, psychological and social discourses (e.g. Comer, 2015).

Forum posts are classified into three framing types by applying different supervised learning algorithms, then distribution and mixture patterns of framing types, their influencing contextual/linguistic/topical factors, and dynamics of these features are examined. We addressed the following questions: How are the three main types of framing distributed? In what pattern are they mixed with each other? What contextual factors (type of forum, communicative behavior of author etc.) influence which framing type is utilized?

Corruption in Online Editorial Media

In recent years, members of our research group have published several studies on corruption research in Hungarian and international leading journals as well. These researches were based on survey data. Turning to nun-survey based methods, our research team has conducted two case studies using NLP methods in corruption research in 2018-2019, . The first case study uses the author-topic model. Using the corpus collected by K-Monitor, we identified 25 corruption topics, and analysed the thematization of the corruption on different websites in different times.

In the second case study, we focused on the temporal changes in the topics of corruption, also in the Hungarian online news sites. We used a dynamic topic model for the analysis in the K-Monitor corpus. Based on 26,000 articles, we analyzed the changes in the popularity and content of typical corruption topics for the period 2007-2018. As a result of the model, we found seven well-separated topics. Our study is currently under review in a leading Hungarian sociology journal.

Our previous studies are mainly descriptive, they can serve as a base for further research. In addition to the empirical analysis, we systematically deal with the question what NLP methods can give to corruption research.

We examine the framework of corruption-definition, furthermore the possibilities of automated processing of huge amounts of texts in corruption research and the data analysis and data processing technologies based on them.

In the course of the educational activity related to the project, K-Monitor also brought a corpus on a data-based hackathon organized for students with K-Monitor and Precognox, which students could use to analyze the data we used in our research team.

Online Antisemitism

The level of antisemitism in Hungary has always been among the highest in Europe. Representative surveys show that approximately 33 to 40 per cent of the Hungarian population is antisemitic. Although there has been some fluctuation, the level of antisemitism has remained quite stable. Moreover, we found, based on representative surveys among Hungarian Jews, that although the proportion of those having experienced or witnessed antisemitic acts one year prior to the survey decreased massively from 79 to 58 per cent between 1999 and 2017, the perception of antisemitism severely deteriorated. While in 1999, 37 per cent of Jews thought that antisemitism was strong or very strong in Hungary, in 2017 65 per cent said the same. This high discrepancy between experience and perception is due to several factors, being one of them the spread of online hatred. This fact makes the analysis of online sources necessary.

Due to the vast amount of unstructured online textual data, their examination demands new tools, one of them being Natural Language Processing (NLP). NLP is an interdisciplinary field of research in the intersection of computer science, artificial intelligence, as well as linguistics. In our research, we apply NLP on a massive corpus of recent Hungarian news articles, social media content, and online forum comments. NLP makes possible not only the examination of the structure, the main topics, and actors of overt antisemitism but the identification of underlying subjects and specificities of latent antisemitism.

The layers of political public sphere in Hungary (2001–2020)

A sociological analysis of the official, media-based and lay online public sphere using automated text analytics and critical discourse analysis

A research project supported by NKFIH (National Research, Development and Innovation Office) (K-134428)

Period of support: December 2020-December 2023

Principal Investigator: Renáta Németh

Participants: Ildikó Barna, Péter Csigó, Domonkos Sik (senior researchers), Jakab Buda, Eszter Katona, Árpád Knap, Márton Rakovics, Zsófia Rakovics, Emese Tóth (junior researchers)


The public sphere is the cornerstone of modern representative democracies: it is responsible not only for providing the voters with the necessary information for a deliberate vote but also to keep the administrative system in check not solely from a legal but also from a moral standpoint. In this sense, the prospect of averting those potential distortions and crises which may emerge in democratic systems depend on the quality of the public sphere (Habermas 1975, 1998). The emergence of the online public sphere overlaps with several waves of significant political transformations and reconfigurations of the political field in Hungary. Therefore, Hungary is a particularly rich context for this research.

The overarching aim of this research is to map the Hungarian online political public sphere since the early 2000s. The transformations of the political and public sphere outline the substantive framework of our research. We plan to analyze the different layers of political discourse including the official channels of communication (e.g., parliamentary speech); the various types of political press (e.g., online press, news portals, tabloid), and also the user-generated contents (online comments, forums, blogs and public Facebook posts). We not only intend to analyze the inner discursive content and dynamics of these layers but also the interactions between them. Moreover, we plan to triangulate these discursive processes with existing polling data to gain a deeper understanding of the interactions between political discourse and public opinion.

We plan to consider the content of the discourses (topics discussed) as well as the language usage/framing. We will identify discursive locations where language polarization appears to describe its linguistic features and explain its mechanisms. We will scrutinize the connection between the manifest and latent opinion climate, the former represented by the political discourse and the latter by polling data. We will not only focus on their similarities but also on their discrepancies.

Digital data produced in online public spheres are primarily textual. Such data require analytical tools, which became accessible only recently with the emergence of the field of Natural Language Processing (NLP) capable of processing large-scale textual data in a systematic, automated way. These innovative tools provide suitable depth in results for sociology (Németh and Koltai, 2020). Sociology will exploit the potential of these changes if it can renew its research culture while preserving its critical reflections. Hence it was our mission to plan a research that shows how NLP can be integrated in an organic way into the toolbox of traditional sociological methods. To reach this aim, we plan to combine automated text analytics with not only qualitative discursive analysis but also traditional quantitative statistical methods.


Habermas, Jürgen. 1975. Legitimation Crisis. Boston: Beacon Press.

Habermas, Jürgen. 1998. Between Facts and Norms: Contributions to a Discourse Theory of Law and Democracy. Cambridge: Polity Press.

Németh, Renáta, and Júlia Anna Koltai. 2020. “Sociological knowledge discovery through text analytics”. In Pathways between Social Science and Computational Social Science – Theories, Methods and Interpretations, edited by Rudas Tamás, Péli Gábor. Springer.


Csomor, Gábor ; Simonovits, Borbála ; Németh, Renáta: Hivatali diszkrimináció?: Egy online terepkísérlet eredményei [Discrimination at local goversments? Results of an online field experiment] SZOCIOLÓGIAI SZEMLE 31 : 1 pp. 4-28. , 25 p. (2021)

Katona, Eszter ; Németh, Renáta: Automatizált szöveganalitika a korrupció kutatásában [Computational text analytics in corruption research] SOCIO.HU : TÁRSADALOMTUDOMÁNYI SZEMLE 11 : 1 pp. 108-124. , 17 p. (2021)

Related doctoral researches

Sociological study of language change and -polarization

Doctoral student: Zsófia Rakovics

Advisors: Renáta Németh, PhD and Domonkos Sik, PhD


Analysing the discourse of sustainability in the triad of political publicity, online media platforms and the lay public

Doctoral student: Emese Tóth

Advisors: Balázs János Kocsis, PhD