1  Computational social science (CSS)

1.1 What is computational social science?

1.1.1 The promise of computational social science

Reading accounts of computational social science, one cannot help but feel excitement about their transformational potential for the social sciences.

In one of the foundational articles sketching the outlines and potentials of the emerging subfield of computational social science, David Lazer and colleagues wrote in 2009:

“(…) a computational social science is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale. (…) These vast, emerging data sets on how people interact surely offer qualitatively new perspectives on collective human behavior (…).”

Lazer et al. (2009), p. 722.

More recently, Anastasia Buyalskaya and colleagues reiterate this early optimism:

“Social science is entering a golden age, marked by the confluence of explosive growth in new data and analytic methods, interdisciplinary approaches, and a recognition that these ingredients are necessary to solve the more challenging problems facing our world.”

Buyalskaya et al. (2021), p. 1.

Marc Keuschnigg and colleagues expect that:

“(…) CSS has the potential to accomplish for sociology what the introduction of econometrics did for economics in the past half century, i.e., to provide the relevant analytical tools and data needed to rigorously address the core questions of the discipline. (…) The new CSS-related data sources and analytical tools provide an excellent fit with a sociological tradition interested primarily in the explanation of networked social systems and their dynamics.”

Keuschnigg et al. (2018), p. 8.

These are just three examples but many accounts of computational social science voice similar optimism regarding the promise they expect CSS to hold for the study of societies and human behavior. These promises usually come in two forms: The first promise focuses on the increased coverage of social phenomena and human behavior through digital trace data and digital sensors, the second goes even further and expects a transformation of the nature of the social sciences.

On the most fundamental level, CSS can be understood as a response to the growing availability of data. The ever more intensive and diverse use of digital technology creates a constantly growing reservoir of data that documents individual behavior and social life in ever higher resolution. Lazer and colleagues describe this potential already in 2009:

“Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies.”

Lazer et al. (2009), p. 721.

Digital technology provides new types of data, offers new and broader approaches for the measurement of the world and social life through sensors and devices, and through digitization makes available and computable vast amounts of previously collected data that up until now could only be analyzed within the limits of their analogue form. The enthusiasm seems therefor more than justified.

First, any interaction of users with online services creates data traces, hence the term digital trace data.1 In principle, digital trace data provide a comprehensive account of user behavior with, and mediated by, digital services. This makes these data highly promising for social scientists, since they promise to provide a comprehensive account of those behavioral and social phenomena that happen on digital services or are mediated by them. Examples for phenomena like these are interaction patterns in political talk online, public interactions with news on digital media, or digital political activism. Additionally, behavioral and social phenomena not primarily associated with digital media but connected to them become also visible in digital trace data. Examples include the analysis of suspected trends in political polarization or extremism based on political talk online or interaction patterns between users on digital media, the mapping of cultural trends based on content on digital media, or discursive power in digital and traditional media.

But, careful! The operative words in the paragraph above are in principle. In practice, most digital traces remain out of reach of most researchers. While digital media companies have access to vast troves of digital traces emerging from the uses of their digital services, researchers have only access to highly limited slices of these data that companies choose to make available to them. This can either happen through dedicated programming interfaces, API, or through exclusive agreements between companies and select researchers for privileged access. This limits the realization of the promise of digital trace data in the social sciences considerably and raises severe practical and ethical concerns in the use of these data. We will come back to this later.

Second, digital technology also extends the number and reach of sensors measuring the world. This could be data emerging as a byproduct of another service, like satellite imagery, or the output of sensors specifically designed by researchers.2 In principle, this data type is only bound to increase with the availability and wide distribution of Internet of Things devices. Yet, this expected wealth of data reinforces important questions of people’s privacy in a world of all-seeing, all-sensing digital devices and the legitimacy of data access for academics, researchers, and industry.

Finally, digital technology also provides new perspectives and opportunities for the work with data available in analogue form. By digitizing existing data sets, researchers can deploy new approaches and methods to already existing data sets.3 This promises new perspectives to old questions by making these analogue data sets available to analytical approaches provided by computation.

This massive increase in the number and diversity of available data sources extends the reach of social scientists. We can expect to cover more social phenomena and more of human behavior in greater detail and wider breadth. This can offer us a window to new questions and phenomena, as well as enabling us to examine well-known phenomena from a different vantage point. This might also allow social scientists to get a better systems-level view of society and human behavior. This has led some to expect computational social science to contribute to a transformation of the social sciences in general.

For some scholars, the availability of vast data sets documenting human behavior has inspired the hope that the social sciences might transcend their status of a “soft” science into an “actual” scientific discipline.4 In other words, a discipline with models allowing for the confident prediction of the future. In this view, more data do not only mean an increase of the coverage of social processes or human behavior but actually would allow for a “measurement revolution” (Watts, 2011) in the social sciences. Thus, social science might transcend its current state of after-the-fact explanation and evolve into a science with true predictive power. This hope rests on a view of society as being shaped by underlying context-independent laws that have mostly remained invisible to scientists due to the lack of opportunities to acquire data that can now be accessed. As with most ambitious dreams, the realization of the transformation of the social sciences seems far off.5

We can find many studies that illustrate the first promise of computational social science. Increases in data documenting social phenomena and human behavior are significantly extending the tool-box available to social scientists. Here, CSS is proving to be a success and to become ever more important as access to data and knowledge about computational methods increase and diffuse among social scientists. The second, expectation (which you can either take as a promise or a thread depending on your faith-based affiliations) of a transformation of social science into a more strictly predictive science remains unfulfilled as of yet. While the faithful might be tempted to treat this as an indicator that we simply need even more data, it might be more plausible that the nature of the social sciences resists this sort of transformation. The subject of social science is the examination of context-dependent phenomena. This makes prediction in the social sciences an instrument of theory-testing and not an instrument of planning and design, as for example in engineering or physics. While CSS might increase the reach and grasp of social scientists, it does not necessarily make us into socio-physicists, nor is it a tragedy if it won’t.

But what is computational social science, besides it providing social scientists with new data?

1.1.2 Computational social science: A definition

While it is true, that digitally induced data riches were a decisive factor in the establishment of computational social science, CSS is more than the computational analysis of digital data. Sure early work in CSS might have spend more time and enthusiasm in the counting of digital metrics and the charting of new data sets than strictly necessary. Also, this somewhat limited activity combined with the hardly contained exuberance of some early proponents of CSS might have given rise to the caricature of CSS as a somewhat complicated effort at counting social media data. More generally, it is limiting to focus definitions of CSS on specific topical subfields. It is true that much early work in CSS focused on digital communication environments. But this is more an artefact of early availability and accessibility of data sets documenting user behavior on social media – especially Facebook and Twitter – than a constitutive feature of CSS. Instead, CSS is the scientific examination of society with digital data sets and computational methods. This can extend to the examination of digitally enabled phenomena but does not have to stop there.

For one, far more and more diverse data sets are now available than in the early days of computational social science, ten years ago. As a result, current research in CSS no longer works primarily with social media data, but instead uses far more diverse datasets. Examples include large text corpora documenting news reporting or literature, historical and current parliamentary speeches, as well as image or video data. At the same time, historical data records are increasingly being made digitally accessible and provide rich opportunities in the social sciences. Also, there is growing awareness among practitioners of computational social science for the need of providing stronger connections between CSS studies and social science theory. This holds for connections to established theories as well as the development of new theoretical accounts.6

In order to characterize computational social science, exclusively data- and method-centric definitions of CSS are therefore too one-sided and consequently outdated. In 2021 Yannis Theocharis and I suggested a definition of CSS, taking current developments into account while also foregrounding what differentiates CSS from other approaches in the social sciences.

Defintion: Computational social science

“We define computational social science as an interdisciplinary scientific field in which contributions develop and test theories or provide systematic descriptions of human, organizational, and institutional behavior through the use of computational methods and practices. On the most basic level, this can mean the use of standardized computational methods on well-structured datasets (e.g., applying an off-the-shelf dictionary to calculate how often specific words are used in hundreds of political speeches), or at more advanced levels the development or extensive modification of specific software solutions dedicated to solving analytically intensive problems (e.g., from developing dedicated software solutions for the automated collection and preparation of large unstructured datasets to writing code for performing simulations).” (Theocharis & Jungherr, 2021, p. 4).

In this definition, the specific properties of new data sets take a backseat. Instead, the definition foregrounds theory-driven work with computational methods in the social sciences. At the same time, it recognizes the importance of descriptively oriented work. This is important not least because CSS opens up new types of behavior and phenomena that only arise as a result of digitization or which were previously beyond the grasp of social scientists. Accordingly, there must be room in CSS for first systematically recording and describing new behaviors or phenomena without forcing them hastily into the limits of well-known but possibly unsuited theories.7

The definition also foregrounds an important point of tension in precisely differentiating CSS from other fields in the social sciences. Nearly all contemporary work in the social sciences relies on computational methods and digital or digitized data. This includes the storage and processing of digital data (such as digital text, image, or audio files), computationally assisted data analysis (such as regression analyses), or data collection through digital sensors (such as eye tracking or internet of things enabled devices). In this work, computation is often a necessary precondition. For example, while it is possible to run multiple regressions with pen and paper, the success of this method in the social sciences depends on the digital representation of the underlying data sets and computational resources available to process the data. In the most general reading of the provided definition the use of any computational method in data handling and analysis would qualify as computational social science. One could thus argue that nearly any form of contemporary social science would constitute computational social science. Obviously, this is not helpful in identifying constituting elements of the field and subsequent potentials and challenges.

In talking about CSS specifically, it might be helpful to focus more on studies and research projects in which computational methods and practices are not used as plug-and-play solutions but instead demand for varying degrees of customization with regard to data collection, preparation, analysis, or presentation. Again, this is best thought of as a distinction in degree. On one end of the scale, we find projects that require some coding with regard to the sequential calling of pre-existing or slightly modified functions or data management. On the other end of the scale, we find research projects that demand the development of dedicated software solutions, for example in automated and continuous data collection, preparation and structuring of large unstructured raw data, or the development of dedicated non-standardized analysis procedures. Projects at different ends of this scale share issues arising from their focus on social behavior, systems, or phenomena but they vary significantly with regard to their computational demands. Projects that use standardized computational methods might thus be basically indistinguishable from other areas in empirical social science research. On the other hand, projects at the other end of the scale are likely to face challenges indistinguishable from software development in computer science.

Any conceptualization of computational social science should thus not be tied to a specific set of methods, data sets, or research interests. Instead, the constituting element of CSS differentiating it from other approaches in the social sciences, is the degree to which research projects demand for the inclusion and development of computational methods and practices over the course of a project. At the same time CSS is a specific subfield in computer research in that it focuses on social systems and phenomena. Consequently, approaches and methods have to account for the specific conditions of this research area.

Computational social science occupies a bridging position between the social sciences, computer science, and related disciplines. This enables researchers to conduct interdisciplinary research into both new and already known social phenomena by combining social science theories and methods as well as concepts and methods from computer science. In this bridging function, CSS gives the social sciences access to advanced computational approaches and methods, while opening up subjects of study in the social sciences to computer science and related disciplines. In the dialogue between the disciplines, CSS contributes to the institutionalized transfer of knowledge and practices and helps at overcoming historically grown barriers between fields. If successful, computational social science does more than just transfer knowledge or methods. It combines theoretical and methodological approaches from related disciplines into viable concepts and research designs and applies them in order to establish scientific knowledge on social phenomena.

1.2 The computational social science project pipeline

Our discussion of computational social science and its promises and challenges has remained rather abstract. It is time to turn to CSS as a practice. For this, let’s have a look at the typical CSS project pipeline. While CSS projects come in a stunning variety of data sets used, methods employed, and questions asked, more often than not, these projects share a pipeline of tasks, problems, and decisions that is typical for CSS. Examining this pipeline allows us to think about engaging in CSS as a practice, while at the same time providing you with a blueprint for potential research projects that might lie in your future.

The typical pipeline for computational social science consists of the following steps:

  • research design,
  • data collection,
  • data preparation,
  • linking signals in data to phenomena of interest,
  • data analysis, and
  • presentation.

Let’s have a look at each of these steps in detail.

1.2.1 Research design

As with any research project in the social sciences, projects in computational social science should start with a research design.8 Researchers must ask themselves how to go about in answering a specific question in a reliable, transparent, and inter-subjective way. This can include questions testing a theoretically expected causal mechanism between two phenomena, explorative questions of new phenomena for which no plausible prior theoretical expectations exist, or the systematic description of phenomena or behavior. The nature of the question then dictates the choice of data, method, and process.

To date, some of the greatest successes of computational social science lie in the description of social phenomena and characteristics of groups and individuals.9 The best of these studies showcase the impressive measurement opportunities of CSS - such as the estimation of environmental conditions based on satellite images in hard to reach areas or the interconnection between online communication and outside factors, such as media coverage or external events. CSS as a field has been less interested in connecting findings systematically to theoretical frameworks in the social sciences, providing explanations or causal mechanisms for patterns identified, or even connecting digital signals robustly to concepts or phenomena of interest. Currently, CSS has been less successful in connecting their findings to theories in the social sciences or advancing new systematic theories. This gap offers interesting new perspectives for new research designs.

CSS has to transition from its early stage of producing predominantly isolated empirical findings to a more mature stage in which studies are more consciously connected with theoretical frameworks, allowing the field to speak more actively to the debates in the broader social sciences trying to make sense of underlying phenomena. This might mean treating predominantly diagnostic efforts as only a first step and focusing researchers’ attention more actively on connecting digital signals to meaningful concepts and starting to work on explaining patterns found in data based on causal mechanisms. This might also mean extending concepts and theories currently in use among social scientists for the conditions found in online communication spaces while at the same time remaining mindful of relevant research interests and frameworks in traditional social science.

While most work in computational social science follows predominantly descriptive empirical approaches, such as the analysis of text, image, or behavioral data, there are other approaches that offer different types of insights. One example for this are experiments.10 By deliberately manipulating actual or simulated digital communication environments, researchers can identify causal effects of specific design decisions or targeted interventions. Even if this approach is effortful in terms of design and implementation, it offers great potential for knowledge.

The need for experimental research designs has been recently illustrated by a study by Burton et al. (2021). The authors show in their paper that in data rich contexts, such as those found in the work with digital trace data, many different explanatory models fit data. Some that conceivably might be true, others that are obviously meaningless. This raises the danger that by using purely correlative research designs in CSS, researchers might fool themselves in believing patterns support their theory of interest while in fact falling for spurious correlations emerging from large and rich data sets. The presence of large data sets makes careful research designs more important not less.

Another alternative method is theory-driven simulation or modeling of social systems or individual behavior.11 This approach has lost relative influence in the course of the rapidly increasing data availability through social media services. Nevertheless, the strongly theory-driven background of this approach offers a promising alternative to the often predominantly data-driven exercises of research based on social media data.

1.2.2 Data collection

After settling on a research design and choosing the appropriate data to answer your question, the fun of getting data starts. It is no accident that the discussion about the alleged wealth of digital data in the social sciences often elegantly skips over questions of whether and how these data can be collected, processed, managed, and made available. In fact, data collection and processing are often the most time-consuming, complicated, and at the same time least visible and most thankless tasks in computational social science.12

Data collection in computational social science has become more complicated over time. This is due to digital media becoming more difficult to collect and increasing scientific standards in working with said data. In the early phases of CSS, the topic of data collection often took a backseat and the procurement of social media data was often enabled by companies running digital services. Over time, however, social media services have become significantly more restrictive in terms of the data access they allow outsiders. At the same time, there was growing awareness in academia that even the generous provision of social media data via official interfaces only provided a fraction of the data necessary for answering demanding research questions. Additionally, CSS practitioners found themselves challenged that their focus on a few well-researched platforms, such as Twitter, only would allow for limited statements about digital communication, human behavior, or societies. This raised the call for more cross-platform research, again raising the demands for data collection and preparation. While true, one cannot help but note that these challenges are often raised by people skeptical of computational work to begin with, if not quantitative methods in general.

Overall, this means that data collection and processing for CSS projects has become significantly more complicated. Different data sources often have to be monitored continuously over long periods of time. Some of these can be queried via official interfaces, so-called Application Programming Interfaces (API) (e.g. Facebook, Twitter, or Wikipedia), while access to some data sources (e.g. individual websites) demand specially adapted software solutions. Both approaches are complicated and prone to different types of errors. With long-term data collections, there is a risk, among other things, that API or non-standardized data sources can change unnoticed.13 Accordingly continuous quality assurance must be ensured which can demand for significant investment of resources and time. Overall, the increasing demands on the breadth, scale, and quality for data collection increasingly require the development of research software adapted to the respective project and can no longer only be mapped with relatively little programming effort and access to isolated API.

1.2.3 Data preparation

Even less well discussed than data collection are issues for computational social science projects arising from the preparation of data for analysis. While API provide clearly structured data, unstructured data from less standardized sources must first be structured after collection. This usually requires the transfer of raw data into database structures developed for the research project. Most research projects also require semi- or fully automated labeling steps in which individual data points are supplemented with meta data (e.g. by coding text according to interpretative categories). In the case of extensive projects, these must be secured and stored together with the originally collected data and made available for further analysis. The use of different software in various steps of data preparation, such as collection, structuring and annotation, complicates this aspect. The design of database structures and work processes, ensuring a consistent and high-performance infrastructure for the analysis of complex data sets, is not trivial and often requires more than rudimentary knowledge in modeling the corresponding database structures.14 Additional knowledge of software development using various libraries and technologies is often required as well.

1.2.4 Linking signals in data to phenomena of interest

The next step in computational social science projects follows the research design and runs parallel to data collection and preparation for analysis. This is providing the connection between signals visible in data and phenomena of interest. Examples for this might be specific interaction patterns between Reddit users as expression of political polarization in society, or mentions of politicians on Twitter as being indicative of their subsequent electoral fortunes. It is important for researchers to critically interrogate their data on whether these signals are actually connected with the phenomenon of interest.

Data emerge based on different data generating processes.15 For example, publicly available Twitter messages are the result of a complicated filtering process leading a user to post a tweet referring to specific topics or persons. Twitter is a performative medium documenting objects of attention or opinions the specific subset of people active on Twitter want to publicly be seen as interacting with or referring to. This makes Twitter a powerful tool to understand dynamics of public attention of politically vocal Twitter users but probably not a tool to understand public opinion in society overall. Data collected on other digital media come with different data generating processes that need to be reflected in the interpretation of identified patterns.

This also means CSS needs to get serious about indicator validation.16 Today, much of CSS relies on face validity. If a digital signal seems to reasonably reflect a phenomenon of interest, no systematic validation tests are undertaken. This allows the quick production of seemingly meaningful findings, speaking to contemporary concerns in public debate. Yet, there is the serious danger of mistaking digital signals for phenomena they are not actually documenting, as for example mistaking signs of attention to politics for political support or predicting the flu by looking for signs of winter. Measurement in the social sciences often means searching for evidence of latent variables that have no direct objectively measurable expression. This demands for the active reflecting, theorizing, and testing of whether identifiable signals can be reasonably expected to express a concept of interest. This makes reliance on face validity in the social sciences dangerous and prone to error.

1.2.5 Data analysis

The next step in typical computational social science projects, data analysis, is much better documented and well discussed than the previous stages. There is a wide variety of methods available in CSS. The use of methods naturally follows choices in research design and the demands and opportunities connected with the available data. Later, we will be examining some typical analytical approaches within CSS in greater detail, so here I will only briefly mention some analytical approaches and choices available to you.

One typical approach is automated or semi-automated content analysis of different digital corpora. For example, the computationally-assisted analysis of text in the social sciences is already very well established and is increasingly complemented by the use of more advanced machine learning methods and (semi-)automated analysis of image data.17 These analyses can be very rudimentary, for example by identifying and counting the occurrences of specific words in text. Or they can be more demanding, for example by looking for expressions of a latent concepts (such as political ideology) in speech or interaction patterns. Analyses can be performed by human coders or automatically. Still, independent of the choice for simple or demanding analytical target or automated versus human coding, these studies have to address fundamental questions of coding validity and reliability that have been well established in the literature on content analysis.

Another approach closely associated with CSS is network analysis.18 Social network analysis is a long-established research practice in social science with a rich body of theories and methods. Methods of network analysis allow the investigation of different relational structures between human (or non-human) actors, often with the aim of understanding the meaning and effects of these structures in different application areas. Instead of an “atomistic” research perspective that sees people primarily as isolated individuals, network analysis pursues a “relational” perspective that takes people’s relationship structures seriously and by mapping them tries to identify their impact. Network analysis is a very prominent approach in CSS. On the one hand, this is due to the fact that digital communication as such is fundamentally closely linked to the concept of networking and networks. Since corresponding societal processes and individual usage behavior are responsible for much of the digital trace data used in CSS, it is no wonder that network analysis is an obvious choice in the analysis of these data. However, this seemingly intuitive proximity often obscures the necessary interpretative steps involved in analyzing networks based on digital trace data.

Increasingly, there are also studies that connect different data types.19 For example, some studies connect people’s survey responses to their digital traces (such as web tracking data). The benefit of studies following this approach is the opportunity to offset some of the limitations of using only one data type. For example, simply relying on people’s survey responses on what type of news media they claim to have consumed is prone to error. People forget, misremember, or might not admit to consuming specific media. On the other hand, inferring people’s political leaning or opinions simply based on digital traces is also fraught. Combining both data types might in principle provide a broader picture of their behavior and effects of their online behavior or information exposure. Other studies combine data collected on different digital media platforms, following a similar research logic. But while offering a broader view into some questions, these combined approaches bring other drawbacks that need to be critically reflected and accounted for.

1.2.6 Presentation: Ensuring transparency and replicability

The final step of any computational social science project is the presentation of its findings. I will not bore you with generalities about the writing and publication process. Instead, let us focus on one crucial element in finalizing a project: providing transparency about your choices and making sure it is replicable by other researchers.

In many social sciences we find important movements that push for the development and institutional adoption of more transparent research practices allowing for a more reliable interrogation of research findings by third parties while at the same time limiting primary researchers’ degrees of freedom in adjusting research questions and designs after knowing the outcomes of data analyses. Proposed remedies include systematically providing public access to data sets, the publication of code underlying data preparation and analysis, and pre-registration of planned research designs and analytical protocols.20 While the importance of this program is recognized in fields such as economics, political science, or psychology, it is largely lacking within CSS.

There are two areas systematically introducing opaqueness into computational social science:

  • data underlying research projects, and
  • transparency with regard to the robustness and inner workings of advanced methods.

One of the central selling propositions of CSS is its use of large and rich data sets. These data sets often stem from commercial online platforms. Accordingly, they come with various concerns regarding the privacy of users whose behavior is documented in them and intellectual property rights of the companies providing researchers with access to them. This brings two challenges: First, how do we ensure access to relevant data for researchers in the first place; and second, once access has been granted, how can researchers provide others access to said data to double check their findings. In these cases, rules set by platforms governing access to proprietary data can serve as cloaking device, rendering data underlying highly visible CSS research intransparent. Here, the field has to become more invested in developing data transparency standards and processes. This might mean pushing back against some of the often arbitrary rules and standards of data access set by platform providers. Those are often designed with commercial uses in mind and serve primarily to protect the business interests of platforms and their public image instead of serving the interests of their users or society at large by enabling reliable and valid scientific work.

Another area of opaqueness in CSS arises from the use of advanced computational methods in an interdisciplinary context. The different disciplines at the intersection of CSS come with different strengths and sensibilities. While typically, there is high comfortableness and skill among computer scientists in software development and the use of quantitative methods, social scientists typically are more interested in addressing actual social instead of predominantly technical questions. This brings the danger of scientists primarily driven by interests and sensibilities in social problems uncritically using analytical tools provided by computationally minded colleagues without critically reflecting on these tools’ inner workings and boundary conditions. In the worst case, this can lead to social scientists misdiagnosing social phenomena based on an uncritical and unreflected use of computational tools and quantitative methods.

At the same time, the development of robust methods in CSS is hampered by a prototype-publication culture. Researchers are incentivized to publish innovative methods which once published are treated as proven by the field. Critical testing of methods and their implementations in code across varying contexts is currently not encouraged by publication practices of the leading conferences and journals in the field. This inhibits the development of a robust collective validation effort of methods and measures.

Already this brief sketch of the typical CSS project pipeline shows the diversity and richness of computational social science. The field is neither defined by specific data types or analytical methods. Rather, CSS is a broad research approach embracing different methods and perspectives. Individual researchers or even most mono-disciplinary teams cannot convincingly represent this diversity. The future of CSS lies in the interdisciplinary merger of the various social sciences, computer science, and natural sciences. This is easier said than done. As anyone who has tried it will tell you, interdisciplinary research is easy to talk about but difficult to practice. To get better at this, it is important to collect and document specific experiences of different projects or research teams. Some documentations are starting to be published.21 But this can only be the beginning of a systematic reflection.

1.3 Text analysis: Typical approaches in computational social science 1

1.3.1 Text analysis in political science

One approach typically associated with computational social science that is very prominently used in political science is text analysis.22 One reason for this is the long tradition of using text corpora in various subfields of political science going back thirty years or more. Examples include the measurement of positions and preferences of parties based on manifestos, the analysis of positions and tactics of parties and politicians based on parliamentary debates, the automated identification of events and actors in text corpora for the development of event- and actor-databases in international relations and conflict studies, or the analysis of news coverage in agenda research and discourse studies. These subfields identified early on the potential of computational approaches in the pursuit of their research goals using large text corpora. Accordingly, we find rich traditions and practices in these subfields going back thirty years or more in working with text. This includes standardized approaches in the collection, preparation, analysis, reporting and the provision of large text corpora, starting with human centered approaches and today reflecting deeply on the uses of and standards for the work with computational methods.

Text is a rich medium reflecting cultural and political concepts, ideas, agendas, tactics, events, and power structures of the time. Large text corpora open windows and allow comparisons across countries, cultures, and time. Different types of text contain representations of different slices of culture, politics, and social life. They therefor are of interest to various subfields in the social sciences and humanities.

For example, political texts (such as parliamentary speeches or party manifestos) express the political ideas of the time and chosen tactics of politicians or parties.23 At the same time, they are also expressions of public performances of these ideas and tactics. They are therefor product of choices by political actors how they want to be seen by journalists, their constituents, party members, or the public at large. This makes them not direct expressions of their true nature or positions but instead expressions mediated through conventions associated with the type of text under examination. These and other potentially relevant mediating factors need to be accounted for in subsequent analyses and the interpretation of any identified patterns. While remaining aware of these mediating factors the analyses of corpora of political texts between countries and over time allows for the identification of various interesting political phenomena, this includes shifts in the positions of parties, the introduction of new ideas into political discourse, agenda shifts, or tactical choices in language or rhetoric. This and their relatively codified and structured format makes political texts popular in political science.

Another important text source in political science is news text.24 News contain a variety of signals that are of great interest to social scientists. This includes the reporting of key political events and figures in politics and society. News cover the what, when, where, who, and sometimes even why of important political or societal events. Extracting these features from journalistic accounts allows the establishment of standardized, large-scale databases of international events and actors. Approaches like these have been successfully used in conflict studies. News texts are also a prominent basis for the analysis of political agenda setting and agenda shifts. Identifying the frequency and time of the coverage of selected topics, researchers can identify the relative importance events have in press coverage and compare that with their importance in political speech, public opinion surveys, or digital communication environments. Finally, the analysis of news coverage also allows for the analysis of discourse dynamics over time. How are current important topics discussed in the media, what are the aspects different sides emphasize, what are the arguments, and who are prominent speakers given voice to in the media? These questions can be answered based on the analysis of news coverage and that provide important insights into the way societies negotiate contested topics, such as foreign policy, immigration, or reproductive rights.

By collecting and preparing for analysis large text corpora scholars can therefor access and make available vast troves of knowledge on various questions and in different subfields. The tremendous collective efforts in digitizing and making available text corpora are a massive accelerating factor in this effort.

To get a better sense of computer-assisted text analysis in action, let’s have a look at three recent studies, using different text types and different methods in answering their respective questions.25

1.3.2 Making sense of party competition during the 2015 refugee crisis with a bag of words

In their 2021 article How the refugee crisis and radical right parties shape party competition on immigration26 Theresa Gessler and Sophia Hunger study a corpus of 120,000 press releases by parties from Austria, Germany, and Switzerland. The authors are interested in whether parties changed the emphasis of the topic immigration and their position on immigration in their press releases between 2013 and 2017. The authors ask whether the attention of parties in their press releases to the topic immigration followed a long-term trend in politicizing the topic driven by the emergence of far-right parties or instead whether attention shifts were driven by the heightened levels of public attention on immigration during the events of 2015. With their study, they contribute to the scientific debate about party competition and agenda setting and position their findings with regard to theories in both areas. At the same time, they provide an instructive example of how to anchor an empirical study within theory, creatively establish and justify new data sources, and use an intuitive and comparatively accessible computational method in the analysis of text.

In order to answer their questions, the authors introduce a new data source: monthly press releases by parties. In their research design section, they justify this choice. The predominant data source for work in comparable fields are party manifestos. Those have proved valuable in the study of political competition and agenda shifts but their characteristics limit their applicability for the authors’ purposes. Due to their sparse publication rhythm, following the electoral calendar, they do lend themselves for the analysis of long-term trends but not for the identification of short-term shifts driven by current events and sudden shifts in public opinion. Press releases, due to their higher frequency and their connection to current events are more promising in this regard. This reasoning opens the intriguing possibility that theorizing about long-term trends within party positioning is less about the subject being necessarily primarily shaped by long-term trends but an artefact from the availability of data sources allowing only for the analysis of this type of question. Mobilizing new data sources thereby potentially opens up new aspects of the phenomenon that remained invisible before.

The authors collected 120,000 press releases from major parties in Austria, Germany, and Switzerland published between 2013 and 2018. To identify press releases referring to immigration, the authors developed a dictionary containing words referring to immigration and integration. To evaluate the performance of their dictionary, the authors hand-coded 750 randomly-selected press releases and tested the quality of different dictionary approaches and a specifically trained support vector machine classifier (SVM). Their dictionary outperformed others in the identification of immigration-related press releases and performed in similar quality as the SVM. Accordingly, they chose their computationally less demanding and at the same time interpretatively more accessible dictionary to classify the remaining press releases over the SVM. The proportion of thus identified press releases referring to immigration of all press releases during a given month allows the authors to identify the relative salience of the topic and temporal shifts over time in comparatively high temporal resolution.27

In order to identify the relative position on immigration of parties, the authors use Wordscores.28 First developed by Michael Laver, Kenneth Benoit, and John Garry, Wordscores try to identify relative topical positions of parties based on the similarities and distinctiveness of words they use in text. The more similar the words, the more similar the positions. The more distinctive, the further they are. Simplifying their approach somewhat, Wordscores allow Gessler and Hunger to identify whether parties meaningfully diverge from their original word use regarding immigration and whether over time they converge or diverge with words used by parties of the radical right in their sample. They take this as proxy for position shifts of parties with regard to immigration in either accommodating or confronting positions of the radical right.

Using these approaches, Gessler and Hunger find that controlling for other factors it does seem that mainstream parties during the refugee crisis reacted to the greater attention paid to immigration by radical right parties by increasing their own attention to the topic in their press releases. But after the crisis subsided, they decreased their attention to the topic back to their original levels. In contrast, regarding the positions nearly all parties did not converge toward positions taken by the far right.

With their study, Gessler and Hunger not only provide compelling evidence on political competition between European mainstream and radical right parties during the 2015 refugee crisis. They also show how mobilizing a new data source can provide new evidence, allowing new perspectives in the analysis of scientifically long established subfields. By capitalizing on the greater temporal resolution provided by press releases, the authors open a window into short term patterns in political competition, otherwise invisible to researchers depending on established data sources only available in much lower frequency. The study is also an interesting case in attempting to identify shifts in two latent concepts (i.e. relative topic salience and position of parties) based on the analysis of text.

1.3.3 Who lives in the past, the present, or the future? A supervised learning approach

In his 2022 article The Temporal Focus of Campaign Communication29 Stefan Müller analyses the degree to which parties refer to the past, the present, and the future in their party manifestos before upcoming elections. He anchors this question with voting behavior theory, which considers retrospective and prospective considerations of voters. Thus temporal considerations matter to voters in elections, but do they also matter for campaign communication by parties. To answer this question, Müller analyses 621 party manifestos published between 1949 and 2017 in nine countries. Other than for Gessler & Hunger (2022), this time the data source is clearly up to the task. Party manifestos are directly connected with elections and should reflect tactical considerations by parties regarding their self-presentation toward partisans, constituents, journalists, and the public at large.

To answer his question, Müller collected all machine-readable manifestos in English or German from the Manifesto Corpus, leaving with 621 manifestos from nine countries.30 He then had human coders label sentences as referring to the past, present, or future. Either by directly labelling or by using pre-labeled data sets Müller ended up with an annotated sample of 5,858 English and 12,084 German sentences. This allowed him to train and validate several different computational approaches for the classification of the remaining sentences in the data set. He trained and validated a Support Vector Machine (SVM), a Multilayer Perceptron Network, and a Naive Bayes Classifier. Since all classifiers performed all comparatively well, the author chose the SVM since it provided the best trade-off between performance and computational efficiency.

Through this approach, Müller finds that 54% of sentences refer to the future, 37% the present, and 9% the past. But there is some variation between countries. In general, though, it appears like incumbent parties focus somewhat more on the past than opposition parties. This makes sense considering the different roles of incumbent and opposition parties in political competition. Incumbents run at least in part on their supposedly positive record of the past and opposition parties naturally challenge said record.

To get a better sense of how parties refer to the past, present, or future, the author uses German and English versions of the Linguistic Inquiry and Word Count (LIWC) sentiment dictionary.31 The dictionary lists terms that for test-subjects carried positive or negative emotional associations. By calculating the emotional loading of words used in sentences referring to the past, present, or future, Müller infers whether parties spoke positively or negatively about different temporal targets.

Using this approach, Müller finds that opposition parties tend to speak more negatively about the past than incumbents. Again, this finding is in line with the different roles of incumbent and opposition parties in political competition.

By using a pre-existing data set, the Manifesto Corpus, and further annotating it, Müller can show that parties indeed use temporal references differently according to their structural roles in political competition. The paper offers an interesting example for the use and evaluation of various supervised computational classifiers in the analysis of large data sets, enabling classification efforts for data sets whose size would make manual classification infeasible.

1.3.4 Political innovation in the French Revolution

And now, let’s have a little bit of fun.

In their 2018 article Individuals, institutions, and innovation in the debates of the French Revolution32 Alexander T. J. Barron, Jenny Huang, Rebecca L. Spang, and Simon DeDeo present an analysis of speeches held during the first parliament of the French Revolution, the National Constituent Assembly (NCA), sitting from July 1789 to September 1791. They have access to a corpus provided by the French Revolution Digital Archive containing 40,000 speeches by roughly a thousand speakers. The speeches held during this time frame are of great interest not only to historians but also parliamentary and democracy scholars, since they open a window into the process of epistemic and political sense making and innovation processes within one of the first modern parliamentary bodies that provided the template for many subsequent parliamentary institutions and democratic discourse in general.

Barron and colleagues approach the text corpus through the lens of information theory. They are interested in determinants for the emergence of new ideas in parliamentary discourse and their persistence. For this they identify distinct word combinations through latent Dirichlet allocation (LDA).33 LDA is a popular automated approach for reducing the dimensionality of text to a set of automatically identified topics that are characterized by the frequent clustered occurence of words. Barron and colleagues use LDA to identify clusters of co-occurring words, topics, and assign them two metrics they calculate based on ideas from information theory: novelty and transcience.

With novelty, the authors refer to the degree to which a distinct word pattern in a speech as identified through LDA differs from patterns in prior speeches. With transcience, the authors refer to how the same word pattern differs from those in future speeches.34 The higher the transcience, the higher said difference. To measure these metrics for each distinct word pattern, they calculate a measure called Kullback-Leibler Divergence (KLD). Doing so allows the authors to quantify two important but interpretatively demanding concepts: How frequently are new ideas introduced in parliamentary discourse and how long do they persist? Once you have quantified these features of distinct word packages, you can start looking for determinants of either outcome. Who is responsible for the introduction of new ideas? And what are the contextual conditions for these ideas to survive and thrive?

Barron and colleagues show that in general the National Constituent Assembly (NCA) was a parliamentary environment clearly open for the introduction of novel ideas, but many of these new ideas did not persist long. Still, many speeches were at once highly novel and only weakly transient, this condition the author call resonance. Looking closely, the authors show that individuals differ with regard to their tendency to introduce new and resonant ideas. In fact, among the top 40 orators high-novelty speakers are usually associated with the political left and the bourgeoisie. In contrast, low-novelty speakers are on the political right and belong to the nobility. Going into detail even further, the authors show that high-profile individuals can deviate from these patterns, such as the left-wing radicals Maximilien Robespierre and Jérôme Pétion de Villeneuve whose speeches showed exceptionally high values of novelty and resonance. They consistently introduced new ideas in their speeches that were picked up by others and persisted over time. On the other side of the political spectrum, speakers like Jean-Siffrein Maury and Jacques de Cazalès exhibited low novelty and high resonance. The authors take this as evidence for their role in keeping the conversation in parliament coherent (low novelty) while at the same time being able to influence its future course (high resonance).

The authors point out that their results correspond with previous findings by historians taking more traditional analytical routes. But by translating meaningful and interpretatively demanding concepts into a small set of elegant quantitative metrics, Barron and colleagues provide a systems-level view of innovation and persistence in parliamentary debate. This further allows them to quantitatively identify the impact of features on individual, structural, and institutional levels for the introduction and subsequent fate of new ideas.

The study by Barron and colleagues shows powerfully how the use of computational methods combined with innovative theoretical concepts can open up new insights not only in our present and digital life but instead provide new perspectives to the past. Studies like these are bound to grow both in frequency and resonance with the continued digitalization of ever more historical data sets and archives and offer promising perspectives for interdisciplinary research.

Why not start by checking which other historical parliamentary records are available to you?

As with any set of examples, I could have chosen different studies that would have been just as interesting or would have provided insights into different approaches to text analysis. But already these three brief examples illustrate the breadth in available data sets, methods, and questions computational text analysis can be employed to answer. It is no wonder then to find this approach to be a highly prominent pursuit in computational social science and beyond.

In reading studies like these, or in CSS in general, always make sure to check for online appendices. Often, there you find the actual information about the ins and outs of doing the analysis, sometime get instructions of replicating reported findings, and get a much better sense in general of how a specific method was implemented.

1.4 Digital trace data: Typical approaches in computational social science 2

1.4.1 Digital trace data

As we have seen, in computational social science there are great hopes and enthusiasms connected with the availability of new data sources. Particularly one new data source features in these accounts: digital trace data.35

Once people interact with digital devices (such as smart phones and smart devices) and services (such as Facebook or Twitter), their digitally mediated interactions leave traces on devices and services. Some of those are discarded, some are stored. Some are available only to the device maker or service provider, some are available to researchers. This last category of digital trace data, those that are stored and available to researchers, has spawned a lot of research activity and enthusiasm over a new measurement revolution in the social sciences. But somewhat more than ten years into this “revolution”, the limits of digital trace data for social science research are becoming just as clear as their promises. Before we look at studies using digital trace data, it is therefore necessary that we look a little more closely at what they are, what characteristics they share, and how this impacts scientific work with them.

In their 2011 article Validity Issues in the Use of Social Network Analysis with Digital Trace Data James Howison, Andrea Wiggins, and Kevin Crowston define digital trace data as:

“(…) records of activity (trace data) undertaken through an online information system (thus, digital). A trace is a mark left as a sign of passage; it is recorded evidence that something has occurred in the past. For trace data, the system acts as a data collection tool, providing both advantages and limitations. The task for using this evidence in network analysis is to turn these recorded traces of activity into measures of theoretically interesting constructs.”

Howison et al. (2011), p. 769.

Replace the term network analysis in the last sentence with social science and you have the crucial task in working with digital trace data before you. This translation of “traces” into “theoretically intersting constructs” demands for accounting for specific characteristics of digital trace data. Back to Howison and colleagues:

“1) it is found data (rather than produced for research), 2) it is event-based data (rather than summary data), and 3) as events occur over a period of time, it is longitudinal data. In each aspect, such data contrasts with data traditionally collected through social network surveys and interviews.”

Howison et al. (2011), p. 769.

Again, replace social network with social science and you are good to go. It is probably best you go, find the article, and read this section of the text yourself, but let me briefly explicate the concerns expressed by Howison and colleagues as they relate to social science more generally.

First, digital trace data are found data. This makes them different from data specifically designed for research purposes. Usually, social scientists approach a question through a research design specifically developed to answer it. You are interested in what people think about a politician? You ask them. You are interested in how a news article shifts opinions? You design an experiment in which you expose some people to the article but not others and later ask both groups for their opinion on the issue discussed in the article. With digital trace data, you do not have that luxury. Instead, you often start from the data and try to connect it back to your interests. What do people think about a politician? Well, maybe look on Twitter and count her mentions. If you really get fancy, maybe run the messages through a sentiment detection method and count the mentions identified as positive and those identified as negative. Want to identify the effects of a news article? Check if people’s traces change after exposure. Already these two examples show that found data can be used in interesting ways. At the same time, you often have to compromise in working with them. Thinking purely from the perspectives of research design and identification approach, digital trace data will often leave you frustrated as they simply might not cover what you need. On the other hand, once you accommodate yourself with what signals are available to you in found data, you might land on new questions and insights that following a purely deductive approach, you might have missed. This is especially true for questions regarding the behavior of users in digital communication environments or the inner workings of said environments.

Second, digital trace data are event data. They document interactions and behavior in collections of single instances. Like a Facebook page, write and @-message on Twitter, comment on a post on Reddit, edit a Wikipedia page, click on a site, on and on. These events can carry a lot of information. For example, measuring the impact of an add through clicks of featured links is a perfectly good approach, as far as this goes. But in social science, we often are interested not only in specific interaction or behavior events. Instead, we ask for the reasons underlying these events, such as attitudes or psychological traits. To get at these, we need to understand how users interpret their action that led to a data trace. Take one of the examples from earlier: If we want to understand public opinion on a politician or a political topic, researchers often look for mentions of an actor or topic in digital trace data. But the motives for mentioning a politician or topic on a digital service vary. One could express support, critique, neutrally point to a topically connected event or quote, or one could try and be funny in front of friends and imagined audiences. Some of these motives might be identifiable by linguistic features around the term, others might not. In connecting the event visible in digital trace data, mentions of actors or topics, with the concept of interest, attitudes toward them, means taking into account the data generating process linking the documented event to the concept of interest. This step is crucial in the work with digital trace data but often neglected in favor of a naive positivism, de facto positing that digital traces do not lie and speak to everything we happen to be interested in.36

Third, Howison et al. (2011) point to digital traces demanding for a longitudinal perspective. Data documenting singular events need to be aggregated in order to speak to a larger phenomenon of interest. But what aggregation rule is appropriate or not? That is open to question. For example, many sociologists are interested in the effects of friendship relations between people. Friendship is an interpretatively demanding concept. This raises many measurement problems. Some people interact with people they consider friends often. Others interact with people they consider friends only seldom but hold deep affection and trust. Just looking at interactions in person, on the phone, or online will therefore not necessarily tell us, who the people are our subjects consider friends. Traditionally, sociologists would survey people to identify people they themselves identify as friends. They therefore have access to the result of the personal calculus of respondents over all interactions and experiences with a person resulting in their assessment of the person as friend or not. Simply looking at digital trace data, as for example email exchanges, public interactions on Twitter, or co-presence in space measured by mobile phones or sensors only provides us with single slices of this calculus. Leaving us to guess the aggregation rule translating single events visible in digital trace data into the latent concept of interest. This is true for social relationships, as friendship, but also for other concepts of interests, such as attitudes or traits. Researchers need to be very careful and transparent in how they choose to aggregate the events visible to them in digital trace data and take them as expression of their concept of interest, especially if its an interpretatively demanding concept.

Finally, Howison et al. (2011) emphasize that digital trace data are “both produced through and stored by an information system” (p. 770). This is important to remember. It means that both the recording of the data as well as access to it depend on the workings of said information system and the organization running it. For one, this means that it is important to differentiate between social and individual factors contributing to an event documented in a data trace and features of the underlying information system. An example for an individual factor leading to a data trace could be my support for a given candidate that makes me favorite a tweet posted on her Twitter feed. Alternatively, a system-level feature could be an algorithm showing me a tweet by said candidate prominently in my Twitter-feed in reaction to which I favorite it in order to read it later. The are two different data generating processes driven by different motives but not discernable by simply looking at the digital trace of the event.

The second consequence of the prominent mediating role of the information system is our dependence of its internal reasons for recording and providing access to data traces. Researchers depend on information systems and their providers in providing them with access and setting access rules. This can be comparatively rich, as currently is the case of Twitter, or comparatively sparse, as is currently the case with Facebook. In any case, shifts in access possibilities and rules are always possible and do not need to follow coherent strategies. This makes research highly dependent on the organizations collecting and providing access to data and introduces a highly troubling set of concerns regarding ethics of data access, conflicts of interests between researchers, organizations, and users, and the transparency and replicability of research findings.

While these challenges persist in the work with digital trace data, this new data type has found a prominent place in social and political science. Unfortunately, the degree to which these challenges are reflected in actual research varies considerably. Nevertheless, let’s take a closer look.

1.4.2 Digital trace data in political science

Digital media have led to far-reaching changes in social life and human behavior.37 These new phenomena lead to new research questions, which digital trace data have been used to address. Areas in political science, that have seen the strongest impact of digital media and research using digital trace data include the practice of politics, political communication, structures and dynamics in the public arena, and discourses.

Various studies using digital trace data focusing on politics and political communication have addressed the behavior of political elites, partisans, and the public.38 This includes the use of digital media in political campaigns, protests, or the coordination of citizens and civil society.

Other studies examine structures of the public arena, media use, and discourses in digital communication environments. This can be, for example, investigating how people use the media. While the traditional approach would exclusively ask people in surveys for their media usage patterns, digital trace data offer powerful additional perspectives through greater reliability and greater resolution.39 An example for this are web tracking data that track and document website visits by respondents. In addition to new perspectives on the use of news sources, digital trace data also offer new perspectives on the influence of different media types and sources. Here, examining agendas of the most important issues in different digital and traditional media and their mutual influence dynamics are promising research areas.

More specific to digital media still are studies focusing on usage dynamics and behavioral patterns of people in their use of digital services. This focus area lends itself especially well for the study through digital trace data.40 Examples include studies focusing on Facebook, Reddit, Twitter, or YouTube. Beyond the study of commercial services, digital trace data have also been successfully been used in studying the behavior of people on e-government services, such as online petitions.

Other studies are using digital trace data to examine the ways governments react to the challenge of digital media.41 In particular, authoritarian states see themselves increasingly challenged in their control of the public by digital media and the associated new possibilities for information and coordination of their population. Digital trace data have provided researchers with promising instruments for documenting and examining digital media provision and attempts at government control in different countries.

Beyond the study of phenomena directly given rise by the use of digital devices and services, researchers are also trying to infer general phenomena based on signals in digital trace data.42 Examples include the estimation of political alignments of social media users or the prediction of public opinion or election results. While often original and sophisticated in the use of methods, the validity of resulting findings are contested since they often risk misattributing meaning to spurious correlations between digital traces and larger societal phenomena.

Now, let’s look at two studies a little more closely to get a better sense of how work with digital trace data actually looks.

1.4.3 Making sense of online censorship decisions

The most direct way to use digital trace data is to learn about the digital communication environment they were collected in. But before your eyes glaze over now in expectation of another starry eyed discussion of hashtag trends on Twitter, there is more to this than the good, the bad, and the ugly of social media. For example, looking at what happens on digital media can tell us a lot about how states regulated speech or try to control their public. One important example for this is China.

By now, there are a number of highly creative and instructive studies available that use data collected on digital media to understand the degree, intensity, and determinants of Chinese censorship activity. In their 2020 paper Specificity, Conflict, and Focal Point: A Systematic Investigation into Social Media Censorship in China43 the authors Yun Tai and King-Wa Fu examine censorship mechanisms on WeChat.

WeChat is an important social media platform in China, which in December 2019, was reported to have more than 1.1 billion monthly active users. It is an umbrella application that bundles many different functions for which Western users would have to use different applications. For example, WeChat allows, among other functions, blogging, private messaging, group chat, or e-payment. Users and companies can publish dedicated pages on which they can post messages and interact with others.

WeChat provides no standardized access to its data through API. So the authors developed a dedicated software to crawl the app, which they termed WeChatscope. The software uses a set of dummy accounts that subscribe to WeChat pages of interest. New URLs posted on these pages are saved and then visited and scraped hourly continuously for 48 hours. At each visit by the crawler, the sites pointed to by the URLs are scraped and meta-data and media content downloaded and saved in a database. If a page disappears, the software saves the official reason for removal given by the platform. The reason “content violation” is given in cases were content is deemed a violation of related law and regulation.

For their study, Yun Tai and King-Wa Fu collected 818,393 public articles on WeChat that were published between 1 March and 31 October 2018. These articles were posted by 2,560 public accounts. Of those 2,345 articles were removed for “content violation”. These articles are what the authors are interested in. More precisely, they are interested in how these articles censored by Chinese regulators differed from others. In order to do so, they first decided to pair each censored article with a non-censored article published on the same account that topically was as similar as possible to the censored article. To identify those pairs of censored/not-censored articles, the authors ran correlated topic models (CTM). This left them with 2,280 pairs of articles published on 751 accounts. To identify the potentially minute difference between articles that led to censorship, the authors used a random forest model. The approach is well suited for identifying meaningful signals in large numbers of input variables.44

Using textual terms to predict censorship decisions, the authors identified “perilous” words, terms whose appearance was more frequent in censored articles than remaining articles. They further differentiated between general terms and those that were unique identifiers of entities (such as place names or organizations), times, or quantities. They found that these “specific” terms were especially perilous, increasing the probability for the censorship of articles considerably.

The authors go on and add to this analysis in further steps. But for our purposes, we have seen enough. So, let’s stop here.

The authors connect their very specific findings to literature on conflict, multi-party games, and coordination. Based on the considerations from these theoretical literatures, they conclude that Chinese censorship reacts strongly and negatively to “specific terms” as those might serve as focal points for subsequent coordination or mobilization of users. Thus not only ideas are suppressed but also the linguistic signifiers allowing for coordination of people around specific causes or places.

The study of government censorship is a continuously moving target, as it remains the hare and hedgehog race between the censor and the censored. The study by Yun Tai and King-Wa Fu is therefor surely not the final word on internet censorship or Chinese censorship. Still, their study provides an important puzzle piece in this debate. More important for us, their study provides a highly creative and instructive example of how to use data collected on digital media for the study of speech and government control. Further, it is also an interesting case of how to connect the highly specific and often abstract results from computational text analysis with more general theoretical debates in the social sciences.

All the more reason for you to read the study yourself.

1.4.4 It’s attention, not support!

Sometimes, we do not only want to learn about what happens in digital communication environments. Sometimes, we want to learn about the world beyond. Digital trace data can help us also in the pursuit these questions. But we need to be a little more careful in reading them, in order not to being misled. A look at studies using digital trace data trying to learn about public opinion is instructive.

Many people use social media to comment publicly about their views about politics or comment on current events. Taking these messages and trying to learn about public opinion could be a good idea. Right from the start, there are some obvious concerns.

First, not everybody uses social media and those who do, differ from the public at large. Also, not everybody who uses social media posts publicly about politics or news.45 So we are left with an even smaller potentially even more skewed section of the public whose public messages we base our estimate of public opinion on.

Second, public posts on politics and the news are public. This might seem like a truism but is a problem for studying public opinion. Many people hold opinions about politics and the news but only a very politically active and dedicated person will posts those publicly, especially in the case of political controversy. By publicly commenting about politics on social media, people demonstrate their political allegiances and convictions for all the world to see. This includes family, friends, colleagues, competitors, and political opponents. Everyone can see what they think about politics and are invited to comment, silently judge, or screenshot.

Third, not only academics are turning to social media to learn about public opinion. For example, journalists, politicians, and campaign organizations are all watching trends and dynamics on social media closely to get a better sense of what the public thinks or is worried about.46 We might worry about the power they give social media to influence their actions or thinking, given the biases listed above, but this will not make them stop doing it. So anyone publicly posting about politics might not be doing this to express their honest and true opinion. Instead, they might be doing it tactically to influence the way journalists or politicians see the world and evaluate which topics to emphasize or which positions to give up.

As a consequence, studying public opinion based on public social media messages means studying comments, links, and interactions by a highly involved, potentially partisan, non-representative group of people, a portion of whom might be posting tactically, in order to influence news coverage or power dynamics within political factions.

Anyone looking at these obstacles and still thinking, social media posts might be a good way to learn about political opinion truly must be fearless. But as it turns out, there are many who try to do just that. Could it be that they are right?

In joint work Harald Schoen, Oliver Posegga, Pascal Jürgens and I decided to check what public Twitter messages can tell us about public opinion and what they can’t. In the 2017 paper Digital Trace Data in the Study of Public Opinion: An Indicator of Attention Toward Politics Rather Than Political Support47 we compared Twitter-based metrics with results from public opinion polls.

To get access to relevant Twitter messages, we worked with the social media data vendor Gnip. We bought access to all public Twitter messages posted during a time span of three months preceding Germany’s federal election of 2013 containing mentions of eight prominent parties. Given Twitter’s current data access policy, we probably would have simply used the official Twitter API to identify and access relevant messages. Back then, this was not possible. As we were only interested in public opinion in Germany, we only considered messages by users who had chosen German as interface language in interacting with Twitter. This choice might underestimate the total number of messages referring to the parties in question but the resulting error should not systematically bias our findings.

We then calculated a number of Twitter-based metrics for each party following prior choices by authors trying to infer public opinion during election campaigns based on Twitter. This included the mention count of parties in keywords and hashtags, the number of users posting about parties in keyword or hashtag mentions, the number of positive and negative mentions, and the number of users posting positively or negatively about a party. To identify the sentiment of messages mentioning a party, we used a Twitter convention prevalent in Germany at the time. Users identified a message as being in support or opposition to a party by using its name in a hashtag followed by a + or - sign (e.g. #cdu+ or #csu-). This choice might not be replicable across other countries or time, but it is a robust approach to check whether for our case explicitly positively or negatively tagged messages are more strongly connected with public opinion than normal mentions.

This variety of considered metrics reflects the challenge in working with digital trace data discussed by Howison et al. (2011) under the term event data. We might have objective ways to count the events in which users mention political parties in a specific way but we have to interpret what this event tells us about their intentions or attitudes.

Continuing, we then compare our Twitter-based metrics with the vote share of each party on election day and compared the resulting error with that of opinion polls. All chosen Twitter-based metrics perform massively worse than estimations based either on the results of the previous federal election in 2009 or polling results. Calculating metrics per day instead of aggregating them over the whole time period shows this error fluctuating strongly over the course of the campaign, with no apparent improvement.

Again, this finding connects back to the concerns raised by Howison et al. (2011) with the term longitudinal data. Mentions of parties happen over a long time period before an election. There are no fixed rules by which to decide over which time period to aggregate the mentions to calculate a metric to predict election outcomes. Any choices in this regard are therefore arbitrary.

We get a better sense of the nature of party mentions on Twitter by looking at their temporal distribution. The number of party mentions spiked in reaction to highly publicized campaign events. People commented on politics in reaction to campaign activities, candidate statements, and media coverage. Twitter mentions were therefore indicative of attention paid to politics by politically vocal Twitter users but did not allow to infer the public’s attitudes or voting intention.

The paper thus shows that there is indeed something we can learn about the world by looking at social media, it just might not be everything we tend to be interested in. In the case of Twitter and public opinion, it looks like Twitter data can show us what politically vocal people on Twitter were paying attention to and were reacting to. But it did not allow for the inference of public opinion at large or the prediction of parties’ electoral fortunes. Getting this distinction right between what we wish digital trace data to tell us (or what other people claim they tell us) and what they actually can tell us giving underlying data generating processes and their inherent characteristics is important. Only by addressing this challenge explicitly and transparently will work based on digital trace data mature and achieve greater recognition in the social science mainstream.

The two studies chosen here offer only a small spotlight on the possibilities in working with digital trace data. As was the case with the examples chosen to illustrate work based on text analysis, there are many other interesting and instructive studies out there. So do please not stop with the studies discussed here, but read broadly to get a sense of the variety of approaches and opportunities open to you in working with digital trace data.

1.5 Learning about the world with computational social science

Computational social science offers us new ways to learn about the world. New data sources emerge, either made available through digitization or bursting into existences through digitalization. New computational methods become available to social scientists. And people coming from different interdisciplinary backgrounds develop new interests in social phenomena and human behavior. This makes CSS into a promising interdisciplinary area. A space of crossroads for people who share interests in social phenomena and human behavior and where people from different scientific backgrounds meet.

Thus it comes as no surprise that CSS comes in many names and has many relations: Some think of it in topical subfields, such as Computational Communication Science (CCS), data science, or sociophysics. Others think of it not as a field but primarily see it through associated methods, such as agent based modeling, network analysis, or computational text analysis. For others still, it is simply a subfield of applied computer science concerned with social systems and phenomena. Each of these perspectives comes with specific insights, strengths, and contributions. Too many in fact to be given appropriate space in this and the preceding episodes.

Still, there are shared concerns among researchers “who develop and test theories or provide systematic descriptions of human, organizational, and institutional behavior through the use of computational methods and practices” (Theocharis & Jungherr, 2021, p. 4). Those include facing the conceptual challenges of translating social science theories and interests into computational concepts and operationalizations, connecting new data sources with established theories or interests while remaining open to new phenomena and behavioral patterns, and the integration of practices and workflows from different disciplines in order to capitalize on the new opportunities emerging from interdisciplinary efforts. Every scholar and every interdisciplinary team across the multitude of CSS subfields face these concerns and challenges to different degrees. There is value in remaining aware of the shared roots, concerns, and challenges of computational social science instead of splintering too early into subfields driven by topical interests or methods. This splintering would risk diluting the collective attention and effort to discussing and working through the challenges of CSS.

Computational social science in all its facettes offers scientists new ways to learn about the world and the impact of digital media on politics and society in particular. True, the emotional response to the incessant interdisciplinary efforts in shaping, contesting, and improving goals, methods and practices of CSS can suddenly shift from being exhilarating to exhausting. For anyone, working in CSS means moving out of their comfort zone of field-specific theories, methods, practices, and workflows. Truly lived scientific interdisciplinarity is challenging. But at the same time also exciting and promising. As in every new area of research, work in CSS is characterized by uncertainties. At the same time, however, it is an area that, precisely because of its freshness and the open questions associated with it, brings unbelievable dynamics thematically, theoretically and methodologically. This makes it, without question, one of the most exciting and rewarding areas of social science today.

1.6 Review questions

  1. Please define computational social science following Theocharis & Jungherr (2021).

  2. Please discuss critically the promise and challenges of computational social science.

  3. Please sketch the typical steps of the project pipeline in computational social science and discuss critically some of the decisions researchers face along the way.

  4. Please discuss the data generating process of Twitter and its consequences for what type of question Twitter data is suited and what type it is not.

  5. Please discuss why we can expect to find insight in social and political phenomena by the computationally assisted analysis of text? What are potentials? What are limitations?

  6. Please define the term digital trace data following Howison et al. (2011).

  7. Please discuss along the lines of Howison et al. (2011) for what type of research project digital trace data are suited and for what type they are not?

  8. What questions do you have to answer to ensure your work with digital trace data is valid according to Howison et al. (2011)?

  1. For more on digital trace data see Howison et al. (2011); Golder & Macy (2014); Jungherr (2015); Salganik (2018).↩︎

  2. For more on sensor data see Pentland (2008); Stopczynski et al. (2014).↩︎

  3. For more on the potential of digitized data corpora see Piper (2018); Underwood (2019); Cirone & Spirling (2021).↩︎

  4. For accounts of coming transformations see Watts (2011); Hofman et al. (2017); González-Bailón (2017).↩︎

  5. For why context-dependency is not a bug but a feature of the social sciences see Flyvbjerk (2001); Elster (2007/2015); Gerring (2012).↩︎

  6. For more examples for the analysis of news articles see Barberá et al. (2021). For literature corpora see Piper (2018); Underwood (2019). For television news casts see Jürgens et al. (2022). For political ads see Schmøkel & Bossetta (2022). For parliamentary speeches see Rauh & Schwalbach (2020). For digitized historical data see Cirone & Spirling (2021). For the theory-CSS disconnect see Jungherr & Theocharis (2017); Jungherr (2019).↩︎

  7. The following three paragraphs follow Theocharis & Jungherr (2021), p. 4 f.↩︎

  8. For a helpful introduction to doing research in computational social science see Salganik (2018).↩︎

  9. For the estimation of environmental conditions based on satellite images see Brandt et al. (2020). For the connection between online communication and news coverage see Jungherr (2014); Wells et al. (2016). For the reaction to external events on digital media see Zhang et al. (2019).↩︎

  10. For experimental evidence on causal effects of specific design decisions in digital environments see Salganik et al. (2006); Salganik & Watts (2009). For evidence on causal effects of interventions see Bail et al. (2018); Munger (2017). For dangers of purely correlative designs in data rich environments see Burton et al. (2021). For simulation see Epstein (2006); Macy & Willer (2002); Miller & Page (2007).↩︎

  11. For more on simulations see Macy & Willer (2002); Epstein (2006); Miller & Page (2007).↩︎

  12. For more on data and data quality in CSS see Posegga (2023).↩︎

  13. For an account of shifting rules of API see van der Vlist et al. (2022).↩︎

  14. For more on data management for social scientists see Weidmann (2023).↩︎

  15. For a discussion of different data generating processes of different sources of digital media see Jungherr & Jürgens (2013). For a detailed discussion of Twitter’s data generating process and its consequences for work based on Twitter-data see Jungherr et al. (2017).↩︎

  16. For the need of indicator validation in computational science and the need to theorize the connection between digital signals and phenomena of interest see Howison et al. (2011); Jungherr et al. (2016); Jungherr (2019). For mistakes driven by face-validity see Lazer et al. (2014); Jungherr et al. (2017); Burton et al. (2021).↩︎

  17. For more on the analysis of text in CSS see Benoit (2020); Gentzkow et al. (2019); Grimmer et al. (2022). For the analysis of images see Williams et al. (2020); Schwemmer et al. (2023).↩︎

  18. For more on network analysis see Wasserman & Faust (1994); Easley & Kleinberg (2010); Howison et al. (2011); DellaPosta et al. (2015); Granovetter (2017).↩︎

  19. For more on the potential of combined data sets of different types see Stier et al. (2020). For challenges and limits to this approach see Jürgens et al. (2020).↩︎

  20. On replication and open science see Angrist & Pischke (2010); Open Science Collaboration (2015); Christensen et al. (2019); Wuttke (2019)↩︎

  21. On the practice of computational social science and the establishment of research teams see King (2011); Gilardi et al. (2022); Windsor (2021).↩︎

  22. For general accounts of computationally-assisted text analysis in the political sciences see Benoit (2020); Grimmer et al. (2022). For the collection and analysis of parliamentary debates see Schwalbach & Rauh (2021). For text analysis and event detection see Beieler et al. (2016). For text analysis of news coverage Barberá et al. (2021). For text analysis in literature or philosophy corpora see Piper (2018); Underwood (2019).↩︎

  23. For a large collection of party manifestos across countries and time Merz et al. (2016). For collections of parliamentary debates see Rauh & Schwalbach (2020).↩︎

  24. Since news content is protected under copyright law, establishing big publicly available text corpora of news coverage is difficult. Luckily, various news organizations start providing standardized access to their archives, which enables the reliable collection and analysis of their respective coverage. See for example New York Times. Content of other news organizations can be accessed through third-party providers or specific licensing deals. For event data see https://www.gdeltproject.org. For an overview of agenda setting research see McCombs & Valenzuela (2004/2021). For examples of discourse analyses based on news coverage see Ferree et al. (2002); Entman (2004); Baumgartner et al. (2008); Benson:2014aa.↩︎

  25. For an overview of computer-assisted text analysis in political science see Grimmer et al. (2022). For practical advice on how to do computational text analysis in R see Silge & Robinson (2017); Hvitfeldt (2022), for quanteda an R package of high popularity with political scientists see Benoit et al. (2018), for Python see Bengfort et al. (2018); Lane & Dyshel (2022).↩︎

  26. Gessler & Hunger (2022).↩︎

  27. To get a better sense of how to evaluate the quality of competing text analysis methods see the online appendix of Gessler & Hunger (2022).↩︎

  28. For more on Wordscores see Laver et al. (2003); Lowe (2008).↩︎

  29. See Müller (2022).↩︎

  30. For the Manifesto Corpus see Merz et al. (2016). For detailed information of the validation and comparison between the different classification approaches see the online appendix to Müller (2022).↩︎

  31. For more on the Linguistic Inquiry and Word Count (LIWC) sentiment dictionary see Tausczik & Pennebaker (2010).↩︎

  32. See Barron et al. (2018). For the corpus see The French Revolution Digital Archive: https://frda.stanford.edu.↩︎

  33. For latent latent Dirichlet allocation (LDA) see Blei et al. (2003).↩︎

  34. For details on the operationalization of novelty and transcience see the online appendix to Barron et al. (2018). For details on the Kullback-Leibler Divergence (KLD) see Kullback & Leibler (1951).↩︎

  35. For research in digital communication environments in general see Salganik (2018). For digital trace data in particular see Howison et al. (2011); Golder & Macy (2014); Jungherr (2015).↩︎

  36. For more on accounting for data generating processes linking available signals to concepts of interest see Jungherr & Jürgens (2013); Jungherr et al. (2016); Jungherr (2015).↩︎

  37. For accounts of how digital media impact politics see Jungherr et al. (2020), the public arena see Jungherr & Schroeder (2022), for their impact on discourses see Jungherr et al. (2019).↩︎

  38. For examples of the study of campaigns with digital trace data see Jungherr (2015); Jungherr et al. (2022), protest see González-Bailón et al. (2011); Jungherr & Jürgens (2014); Theocharis et al. (2015), for civil society see Theocharis et al. (2017).↩︎

  39. For examples the integration of survey and digital trace data see Subhayan Mukerjee (2018); Jürgens et al. (2020); Scharkow et al. (2020), for agenda dynamics see Neuman et al. (2014); Posegga & Jungherr (2019); Gilardi et al. (2021).↩︎

  40. For Facebook see Stier et al. (2017), for Instagram see Kargar & Rauchfleisch (2019), for Reddit see An et al. (2019), for Sina Weibo see Chen et al. (2013), for Twitter see Gaisbauer et al. (2021), for WeChat see Knockel et al. (2020), for YouTube see Rauchfleisch & Kaiser (2020), for online petitions see Jungherr & Jürgens (2010).↩︎

  41. For examples for studies examining government control with digital trace data see Chen et al. (2013); Lutscher et al. (2020); Tai & Fu (2020); Lu & Pan (2021).↩︎

  42. For estimating political alignment of social media users based on digital trace data see Barberá (2015). For the use of digital trace data to infer public opinion and prediction elections see Beauchamp (2017). For critical accounts see Jungherr et al. (2016); Jungherr et al. (2017); Rivero (2019).↩︎

  43. See Tai & Fu (2020).↩︎

  44. For correlated topic models see Blei & Lafferty (2007). For random forests see Breiman (2001).↩︎

  45. See Auxier & Anderson (2021).↩︎

  46. See Anstead & O’Loughlin (2015); McGregor (2019); McGregor (2020).↩︎

  47. See Jungherr et al. (2017).↩︎