Projects – ELTE Research Center for Computational Social Science

Data Science in Social Research

One of the challenges of applying data analytics in sociology is the institutionalization of data science outside of sociology, as the former expertise of sociology was based on its own method of research. Another challenge is epistemological in nature, relates to the noisiness and validity of digital data, and the question of explanation/causation, which is highly important for sociology. These challenges give the background of the tension between the Big Data based, social-related findings and the sociological skepticism questioning the potential of this knowledge-production. The challenges can be solved through the redefinition of the research methodological basis of sociology, by the organic incorporation of data science know-how to its own methods. The solution also needs the combined application of qualitative and quantitative analysis motives, and the use of knowledge-driven science instead of the data-driven approach.

Foregoing results

Our research stream is motivated by a continuously growing social science interest in data science. As an example, see the case of automated text analytics: the following figure shows that the popularity of automated text analytics has been continuously growing in recent years in general and also in each discipline investigated (to access publication data we used Dimensions, https://dimensions.ai). Each trend line is growing persistently even after normalizing for the total number of publications in the discipline. The topic’s percentage portion in sociology increased faster than in sciences in general. In summary, automated text analytics is becoming an increasingly recognized approach in sociology.

Discursive framing of depression in online health communities

Depression is a disease of modernity, where societies impose increased responsibility on the individual, while the individual does not have the opportunity to change his or her circumstances (Sik 2018). In this sense, the problem of depression is embedded into the more general problem of the distortion of social integration.
A current question in sociology is how mental disorders are framed by health professionals and by the patients themselves. A related questions is how psychotherapists transform social suffering into suffering related to the self (see e.g. Flick, 2016).

Previous research in this field has been primarily qualitative. Investigators have used qualitative content analysis of offline texts (personal diaries, letters, interviews) to investigate the framing of depression (e.g. Riskind et al, 1989). We believe that there is significant research utility in the application of automated text analysis methods to investigate the framing of depression in online, patient-generated non-clinical texts.

We investigate the potential for NLP techniques in understanding individual framing of depression in online health communities. Framing of depression is a social construction, it defines the meaning of depression, gives a causal explanation of it and can even determine treatment preferences. The current clinical explanations of depression point to biological, psychological and social discourses (e.g. Comer, 2015).

Forum posts are classified into three framing types by applying different supervised learning algorithms, then distribution and mixture patterns of framing types, their influencing contextual/linguistic/topical factors, and dynamics of these features are examined. We addressed the following questions: How are the three main types of framing distributed? In what pattern are they mixed with each other? What contextual factors (type of forum, communicative behavior of author etc.) influence which framing type is utilized?

Corruption in Online Editorial Media

In recent years, members of our research group have published several studies on corruption research in Hungarian and international leading journals as well. These researches were based on survey data. Turning to nun-survey based methods, our research team has conducted two case studies using NLP methods in corruption research in 2018-2019, . The first case study uses the author-topic model. Using the corpus collected by K-Monitor, we identified 25 corruption topics, and analysed the thematization of the corruption on different websites in different times.

In the second case study, we focused on the temporal changes in the topics of corruption, also in the Hungarian online news sites. We used a dynamic topic model for the analysis in the K-Monitor corpus. Based on 26,000 articles, we analyzed the changes in the popularity and content of typical corruption topics for the period 2007-2018. As a result of the model, we found seven well-separated topics. Our study is currently under review in a leading Hungarian sociology journal.

Our previous studies are mainly descriptive, they can serve as a base for further research. In addition to the empirical analysis, we systematically deal with the question what NLP methods can give to corruption research.

We examine the framework of corruption-definition, furthermore the possibilities of automated processing of huge amounts of texts in corruption research and the data analysis and data processing technologies based on them.

In the course of the educational activity related to the project, K-Monitor also brought a corpus on a data-based hackathon organized for students with K-Monitor and Precognox, which students could use to analyze the data we used in our research team.

Online Antisemitism

The level of antisemitism in Hungary has always been among the highest in Europe. Representative surveys show that approximately 33 to 40 per cent of the Hungarian population is antisemitic. Although there has been some fluctuation, the level of antisemitism has remained quite stable. Moreover, we found, based on representative surveys among Hungarian Jews, that although the proportion of those having experienced or witnessed antisemitic acts one year prior to the survey decreased massively from 79 to 58 per cent between 1999 and 2017, the perception of antisemitism severely deteriorated. While in 1999, 37 per cent of Jews thought that antisemitism was strong or very strong in Hungary, in 2017 65 per cent said the same. This high discrepancy between experience and perception is due to several factors, being one of them the spread of online hatred. This fact makes the analysis of online sources necessary.

Due to the vast amount of unstructured online textual data, their examination demands new tools, one of them being Natural Language Processing (NLP). NLP is an interdisciplinary field of research in the intersection of computer science, artificial intelligence, as well as linguistics. In our research, we apply NLP on a massive corpus of recent Hungarian news articles, social media content, and online forum comments. NLP makes possible not only the examination of the structure, the main topics, and actors of overt antisemitism but the identification of underlying subjects and specificities of latent antisemitism.

The layers of political public sphere in Hungary (2001–2020)

A sociological analysis of the official, media-based and lay online public sphere using automated text analytics and critical discourse analysis

A research project supported by NKFIH (National Research, Development and Innovation Office) (K-134428)

Rating of the final project report: 10 (excellent). Excerpt from the evaluation: ‘The research as a whole produced significant methodological results, with participants developing and combining text analysis solutions based on the latest NLP developments. The results of the research can be used for further specific research in a wide range of social sciences as well as in the business domain.”

Period of support: December 2020-December 2023

Date of the report: 20. December 2023.

Principal Investigator: Renáta Németh

Participants: Ildikó Barna, Jakab Buda, Eszter Katona, Árpád Knap, Tibor Pólya (HUN-REN TTK), Márton Rakovics, Zsófia Rakovics, Domonkos Sik, Emese Tóth, Anna Unger

Summary

The public sphere is the cornerstone of modern representative democracies: it is responsible not only for providing the voters with the necessary information for a deliberate vote but also to keep the administrative system in check not solely from a legal but also from a moral standpoint. In this sense, the prospect of averting those potential distortions and crises which may emerge in democratic systems depend on the quality of the public sphere (Habermas 1975, 1998). The emergence of the online public sphere overlaps with several waves of significant political transformations and reconfigurations of the political field in Hungary. Therefore, Hungary is a particularly rich context for this research.

The research provided a sociological analysis of the public discourse of the last two decades at different levels of the Hungarian public sphere – the official political sphere, the online media and the online lay public – focusing on a few key aspects, mainly based on the automated analysis of large text corpora. We have explored the linguistic representation of political polarization, the discourse of memory politics, collective identity issues and certain public policy topics.

Digital data produced in online public spheres are primarily textual. Such data require analytical tools, which became accessible only recently with the emergence of the field of Natural Language Processing (NLP) capable of processing large-scale textual data in a systematic, automated way. These innovative tools provide suitable depth in results for sociology (Németh and Koltai, 2020). Sociology will exploit the potential of these changes if it can renew its research culture while preserving its critical reflections. Hence it was our mission to plan a research that shows how NLP can be integrated in an organic way into the toolbox of traditional sociological methods. To reach this aim, we plan to combine automated text analytics with not only qualitative discursive analysis but also traditional quantitative statistical methods.

The project used or further developed several tools of NLP (structural topic model, biterm topic model, dynamic word embedding, document embedding, keyness analysis), which had no or only occasionally been used in Hungarian sociological research. NLP was integrated into the traditional text analytical tools of Sociology and combined with qualitative tools. According to our results, these methods can successfully measure political polarization and map the dynamics of relations between actors in the public sphere, the framing of topics in public discourse, changes in framing, or changes in the meaning of certain key concepts.

We consider our research to has been successfully completed in in terms of both the research results and publications, the implemented innovative methodological approaches and the established new research collaborations Below is a summary dated December 2023.

Publications, disseminations

The output of the research is several times higher than what was committed in the application. 48 scientific publications have been produced (with a total impact factor of 9.6), of which 11 have been published in peer-reviewed journals and 4 are under review. Of these, 7 international articles (3 D1 and 3 Q2), 4 national articles, 10 international conference presentations. One of the D1 articles of the project (authored by Barna-Knap) won the Polányi Prize for the best sociological article of the year in 2023, awarded by the Hungarian Sociological Association.

The list at the bottom of this page provides only a selection of the four dozen publications (also identifiable in MTMT), with only the final, highest-ranking publication from each sub-project. A special issue of the journal Intersections on the topic of our research (‘Text as data – Eastern and Central European political discourses from the perspective of computational social science’), initiated and partly guest-edited by members of our research team, is scheduled to appear in 2024, with four articles from the research under review.

We have also organised a conference and several conference sessions. In the summer of 2021, Ildikó Barna and Renáta Németh organised a session at the international conference of the ISA RC33 committee (‘Natural Language Processing: a New Tool in the Methodological Tool-Box of Sociology’), where we also presented our research. In October 2023, our members (Zsófia Rakovics, Eszter Katona, Emese Tóth) organized a session at the Hungarian Sociological Society’s Annual Meeting entitled ‘Natural Language Processing in the Social Sciences’, where we also presented our research.

We have reached out to the wider public in various forums, and besides our website and Facebook page, we have held six educational presentations: we gave a presentation and participated in a roundtable at the Night of Researchers, the Mihály Táncsics Talent College and the Róbert Angelusz Social Sciences College, we participated in the ConTEXT business conference, and Márton Rakovics gave a lecture at the invitation of the University of Osijek at the Faculty of Law in September 2023.

Research recruitment education, new scientific relationships

We were also able to use the research in the training of young researchers: 3 PhD theses were successfully advertised, with Jakab Buda, Zsófia Rakovics, Emese Tóth joining the research, details of their topics can be found below. Four doctoral and one postdoctoral New National Excellence Program (now University Research Scholarship Program, EKÖP) supported research, theses and TDK (Scientific Student Council) theses were linked to the project. We also integrated the research methodology and results into our taught courses.

As the project has progressed, collaborations have been established that have allowed for deeper analysis as a new interdisciplinary research direction. Thus, we worked with social psychologist Bori Simonovits, political scientists Gábor Simonovits and Anna Unger, human geographer Péter Balogh, religious researcher András Máté-Tóth and narrative psychologist Tibor Pólya as co-authors.

Innovative methodological solutions

Another important output of the project is the testing and introduction of innovative methodological approaches. Several approaches and NLP tools were used and partly developed (structural topic model, biterm topic model, dynamic word embedding, document embedding, keyness analysis), which had no or only occasionally been used in domestic social research. These are discussed in more detail in the scientific results below.

One of the biggest challenges of the project was the collection of the corpus. According to the basic concept of the research, three levels of public (political, media and lay public) were distinguished and the corpora were collected accordingly. The creation of the media corpus was the most human resource-intensive task, with four Master’s students and three junior researchers working on it from the first year of the project, in professional cooperation with the Centre for Digital Humanities (ELTE DH) of ELTE University of Applied Sciences. The corpus was built following the methodology developed by Indig and co-authors (Indig B. et al, 2020), under the guidance of Árpád Knap, one of the authors of the referenced work. The specificity of corpus construction is that the corpus was carefully metadata-edited and archival-edited, solving technical challenges such as different medium structure, filtration of duplicates or multiple page’s structure. The task was completed by mid-2022, and the corpus became part of a repository maintained by ELTE DH and accessible for academic research on the Zenodo platform (https://zenodo.org).

To process the corpus, we needed a standardized cleaning and pre-processing pipeline developed for Hungarian. The stages of this process were: character standardization, filtering of the texts for certain aspects, cleaning and filtering of the words, word formatting and standardization of the words. For Hungarian, there are several linguistic solutions for these tasks, and after reviewing them, we have created a convention pipeline in Python on GitHub, to which we provide access on request.

Scientific results

Methodological results

Topic: Using NLP to research political polarisation in general

Related publication: Németh, Renáta (2023): A scoping review on the use of natural language processing in research on political polarization: trends and research prospects. Journal of Computational Social Science

The article provided the methodological basis for the project. It summarised studies published on the topic since 2010 to clarify how the NLP research paradigm conceptualises and operationalises political polarisation, looking for patterns to follow and trying to identify research white spots that our research might aspire to fill.

Topic: How to measure political polarisation? Proposing a linguistically grounded metric

Related Publication: Buda Jakab, Németh Renáta, Simonovits Bori, Simonovits Gábor (2022): The language of discrimination: assessing attention discrimination by Hungarian local governments. Language Resources and Evaluation

In our project, we considered polarization as a supervised machine learning problem, and investigated the effectiveness of predicting the author’s party affiliation based on, for example, speeches of members of parliament belonging to different parties, and this effectiveness also served as a general measure of polarization. In this pilot work, we used the text of municipal office emails (i.e. not yet political texts) written to (putative) Roma and non-Roma clients to show that differences in textual data can be detected automatically without human coding, and that machine learning can detect distinguishing features that human coders might not recognise. Our study has also attempted to perform a task of primary importance in polarization research, the interpretation of models, i.e., the identification of the linguistic features that the algorithm recognizes behind the distinction.

Topic: How can changes in the meaning of political expressions be investigated? An NLP-based solution proposal

Related publication: Rakovics Zsófia (2022): Temporal Positive Pointwise Mutual Information (TPPMI) időbeli szóbeágyazási modell alkalmazásában rejlő lehetőségek demonstrálása – A miniszterelnöki beszédek szavainak jelentésváltozása. [Demonstrationg potentials in the application of the Temporal Positive Pointwise Mutual Information (TPPMI) temporal word-embedding model – The change in meaning of the words in the prime ministers’ speeches] In: Feledy, A. & Egle, B. (Eds.), Van új a nap alatt: Az ELTE Angelusz Róbert Társadalomtudományi Szakkollégium konferenciájának tanulmánykötete [There is something new under the sun: Proceedings of the conference of the Angelusz Róbert College for Advanced Studies in Social Sciences at ELTE.

The author is currently working with Márton Rakovics on an international publication to present the results.

One of the main issues of our project, the method developed to investigate the changing meanings of political concepts, is described. It proposes to quantitatively investigate semantic dynamics by means of a temporal word embedding model developed for this purpose.

Topic: Sociological application challenges of supervised machine learning

Related publication: Németh, Renáta (2021): A felügyelt gépi tanulás kihívásai a szociológiai alkalmazásokban. [The challenges of supervised machine learning in sociological applications] Metszetek – Társadalomtudományi folyóirat, Big Data special issue.

The sociological applications of supervised machine learning, already well demonstrated in industrial/business applications, raise specific questions. The reason for this specificity is that in these applications the algorithm is responsible for learning complex concepts. This paper provides a summary of these challenges and possible solutions.

Topic: the integration of NLP into sociological methodology

Related publication: Németh, Renáta; Koltai, Júlia (2023): Natural language processing: The integration of a new methodological paradigm into sociology. Intersections: East European Journal of Society and Politics

Integrating NLP into sociology faces a number of challenges. NLP has been institutionalised outside sociology, while sociology has built its expertise on its own research methods. Another challenge is epistemological: it relates to the validity of digital data and the different perspectives associated with predictive and causal approaches. In our paper we have offered some possible solutions to these challenges.

Results in content

In the research, we attempted to map the discourses in the official political, media and social media layers of the Hungarian public between 2000 and 2020 (see figure below).