2.2. The computational social science project pipeline

Our discussion of computational social science and its promises and challenges has remained rather abstract. It is time to turn to CSS as a practice. For this, let's have a look at the typical CSS project pipeline. While CSS projects come in a stunning variety of data sets used, methods employed, and questions asked, more often than not, these projects share a pipeline of tasks, problems, and decisions that is typical for CSS. Examining this pipeline allows us to think about engaging in CSS as a practice, while at the same time providing you with a blueprint for potential research projects that might lie in your future.

The typical pipeline for computational social science consists of the following steps:

  • research design,

  • data collection,

  • data preparation,

  • linking signals in data to phenomena of interest,

  • data analysis, and

  • presentation.

Let's have a look at each of these steps in detail.

2.2.1. Research design

As with any research project in the social sciences, projects in computational social science should start with a research design. Researchers must ask themselves how to go about in answering a specific question in a reliable, transparent, and inter-subjective way. This can include questions testing a theoretically expected causal mechanism between two phenomena, explorative questions of new phenomena for which no plausible prior theoretical expectations exist, or the systematic description of phenomena or behavior. The nature of the question then dictates the choice of data, method, and process.

To date, some of the greatest successes of computational social science lie in the description of social phenomena and characteristics of groups and individuals. The best of these studies showcase the impressive measurement opportunities of CSS - such as the estimation of environmental conditions based on satellite images in hard to reach areas or the interconnection between online communication and outside factors, such as media coverage or external events. CSS as a field has been less interested in connecting findings systematically to theoretical frameworks in the social sciences, providing explanations or causal mechanisms for patterns identified, or even connecting digital signals robustly to concepts or phenomena of interest. Currently, CSS has been less succesfull in connecting their findings to theories in the social sciences or advancing new systematic theories. This gap offers interesting new perspectives for new research designs.

CSS has to transition from its early stage of producing predominantly isolated empirical findings to a more mature stage in which studies are more consciously connected with theoretical frameworks, allowing the field to speak more actively to the debates in the broader social sciences trying to make sense of underlying phenomena. This might mean treating predominantly diagnostic efforts as only a first step and focusing researchers' attention more actively on connecting digital signals to meaningful concepts and starting to work on explaining patterns found in data based on causal mechanisms. This might also mean extending concepts and theories currently in use among social scientists for the conditions found in online communication spaces while at the same time remaining mindful of relevant research interests and frameworks in traditional social science.

While most work in computational social science follows predominantly descriptive empirical approaches, such as the analysis of text, image, or behavioral data, there are other approaches that offer different types of insights. One example for this are experiments. By deliberately manipulating actual or simulated digital communication environments, researchers can identify causal effects of specific design decisions or targeted interventions. Even if this approach is effortful in terms of design and implementation, it offers great potential for knowledge.

The need for experimental research designs has been recently illustrated by a study by Burton, Cruz, and Hahn [2021]. The authors show in their paper that in data rich contexts, such as those found in the work with digital trace data, many different explanatory models fit data. Some that conceivably might be true, others that are obviously meaningless. This raises the danger that by using purely correlative research designs in CSS, researchers might fool themselves in believing patterns support their theory of interest while in fact falling for spurious correlations emerging from large and rich data sets. The presence of large data sets makes careful research designs more important not less.

Another alternative method is theory-driven simulation or modeling of social systems or individual behavior. This approach has lost relative influence in the course of the rapidly increasing data availability through social media services. Nevertheless, the strongly theory-driven background of this approach offers a promising alternative to the often predominantly data-driven exercises of research based on social media data.

2.2.2. Data collection

After settling on a research design and choosing the appropriate data to answer your question, the fun of getting data starts. It is no accident that the discussion about the alleged wealth of digital data in the social sciences often elegantly skips over questions of whether and how these data can be collected, processed, managed, and made available. In fact, data collection and processing are often the most time-consuming, complicated, and at the same time least visible and most thankless tasks in computational social science.

Data collection in computational social science has become more complicated over time. This is due to digital media becoming more difficult to collect and increasing scientific standards in working with said data. In the early phases of CSS, the topic of data collection often took a backseat and the procurement of social media data was often enabled by companies running digital services. Over time, however, social media services have become significantly more restrictive in terms of the data access they allow outsiders. At the same time, there was growing awareness in academia that even the generous provision of social media data via official interfaces only provided a fraction of the data necessary for answering demanding research questions. Additionally, CSS practitioners found themselves challenged that their focus on a few well-researched platforms, such as Twitter, only would allow for limited statements about digital communication, human behavior, or societies. This raised the call for more cross-platform research, again raising the demands for data collection and preparation. While true, one cannot help but note that these challenges are often raised by people skeptical of computational work to begin with, if not quantitative methods in general.

Overall, this means that data collection and processing for CSS projects has become significantly more complicated. Different data sources often have to be monitored continuously over long periods of time. Some of these can be queried via official interfaces, so-called Application Programming Interfaces (API) (e.g. Facebook, Twitter, or Wikipedia), while access to some data sources (e.g. individual websites) demand specially adapted software solutions. Both approaches are complicated and prone to different types of errors. With long-term data collections, there is a risk, among other things, that API or non-standardized data sources can change unnoticed. Accordingly continuous quality assurance must be ensured which can demand for significant investment of ressources and time. Overall, the increasing demands on the breadth, scale, and quality for data collection increasingly require the development of research software adapted to the respective project and can no longer only be mapped with relatively little programming effort and access to isolated API.

2.2.3. Data preparation

Even less well discussed than data collection are issues for computational social science projects arising from the preparation of data for analysis. While API provide clearly structured data, unstructured data from less standardized sources must first be structured after collection. This usually requires the transfer of raw data into database structures developed for the research project. Most research projects also require semi- or fully automated labeling steps in which individual data points are supplemented with meta data (e.g. by coding text according to interpretative categories). In the case of extensive projects, these must be secured and stored together with the originally collected data and made available for further analysis. The use of different software in various steps of data preparation, such as collection, structuring and annotation, complicates this aspect. The design of database structures and work processes, ensuring a consistent and high-performance infrastructure for the analysis of complex data sets, is not trivial and often requires more than rudimentary knowledge in modeling the corresponding database structures. Additional knowledge of software development using various libraries and technologies is often required as well.

2.2.4. Linking signals in data to phenomena of interest

The next step in computational social science projects follows the research design and runs parallel to data collection and preparation for analysis. This is providing the connection between signals visible in data and phenomena of interest. Examples for this might be specific interaction patterns between Reddit users as expression of political polarization in society, or mentions of politicians on Twitter as being indicative of their subsequent electoral fortunes. It is important for researchers to critically interrogate their data on whether these signals are actually connected with the phenomenon of interest.

Data emerge based on different data generating processes. For example, publicly available Twitter messages are the result of a complicated filtering process leading a user to post a tweet referring to specific topics or persons. Twitter is a performative medium documenting objects of attention or opinions the specific subset of people active on Twitter want to publicly be seen as interacting with or referring to. This makes Twitter a powerful tool to understand dynamics of public attention of politically vocal Twitter users but probably not a tool to understand public opinion in society overall. Data collected on other digital media come with different data generating processes that need to be reflected in the interpretation of identified patterns.

This also means CSS needs to get serious about indicator validation. Today, much of CSS relies on face validity. If a digital signal seems to reasonably reflect a phenomenon of interest, no systematic validation tests are undertaken. This allows the quick production of seemingly meaningful findings, speaking to contemporary concerns in public debate. Yet, there is the serious danger of mistaking digital signals for phenomena they are not actually documenting, as for example mistaking signs of attention to politics for political support or predicting the flu by looking for signs of winter. Measurement in the social sciences often means searching for evidence of latent variables that have no direct objectively measurable expression. This demands for the active reflecting, theorizing, and testing of whether identifiable signals can be reasonably expected to express a concept of interest. This makes reliance on face validity in the social sciences dangerous and prone to error.

2.2.5. Data analysis

The next step in typical computational social science projects, data analysis, is much better documented and well discussed than the previous stages. There is a wide variety of methods available in CSS. The use of methods naturally follows choices in research design and the demands and opportunities connected with the available data. Later, we will be examining some typical analytical approaches within CSS in greater detail, so here I will only briefly mention some analytical approaches and choices available to you.

One typical approach is automated or semi-automated content analysis of different digital corpora. For example, the computationally-assisted analysis of text in the social sciences is already very well established and is increasingly complemented by the use of more advanced machine learning methods and (semi-)automated analysis of image data. These analyses can be very rudimentary, for example by identifying and counting the occurrences of specific words in text. Or they can be more demanding, for example by looking for expressions of a latent concepts (such as political ideology) in speech or interaction patterns. Analyses can be performed by human coders or automatically. Still, independent of the choice for simple or demanding analytical target or automated versus human coding, these studies have to address fundamental questions of coding validity and reliability that have been well established in the literature on content analysis.

Another approach closely associated with CSS is network analysis. Social network analysis is a long-established research practice in social science with a rich body of theories and methods. Methods of network analysis allow the investigation of different relational structures between human (or non-human) actors, often with the aim of understanding the meaning and effects of these structures in different application areas. Instead of an "atomistic" research perspective that sees people primarily as isolated individuals, network analysis pursues a "relational" perspective that takes people's relationship structures seriously and by mapping them tries to identify their impact. Network analysis is a very prominent approach in CSS. On the one hand, this is due to the fact that digital communication as such is fundamentally closely linked to the concept of networking and networks. Since corresponding societal processes and individual usage behavior are responsible for much of the digital trace data used in CSS, it is no wonder that network analysis is an obvious choice in the analysis of these data. However, this seemingly intuitive proximity often obscures the necessary interpretative steps involved in analyzing networks based on digital trace data.

Increasingly, there are also studies that connect different data types. For example, some studies connect people's survey responses to their digital traces (such as web tracking data). The benefit of studies following this approach is the opportunity to offset some of the limitations of using only one data type. For example, simply relying on people's survey responses on what type of news media they claim to have consumed is prone to error. People forget, misremember, or might not admit to consuming specific media. On the other hand, inferring people's political leaning or opinions simply based on digital traces is also fraught. Combining both data types might in principle provide a broader picture of their behavior and effects of their online behavior or information exposure. Other studies combine data collected on different digital media platforms, following a similar research logic. But while offering a broader view into some questions, these combined approaches bring other drawbacks that need to be critically reflected and accounted for.

2.2.6. Presentation: Ensuring transparency and replicability

The final step of any computational social science project is the presentation of its findings. I will not bore you with generalities about the writing and publication process. Instead, let us focus on one crucial element in finalizing a project: providing transparency about your choices and making sure it is replicable by other researchers.

In many social sciences we find important movements that push for the development and institutional adoption of more transparent research practices allowing for a more reliable interrogation of research findings by third parties while at the same time limiting primary researchers' degrees of freedom in adjusting research questions and designs after knowing the outcomes of data analyses. Proposed remedies include systematically providing public access to data sets, the publication of code underlying data preparation and analysis, and pre-registration of planned research designs and analytical protocols. While the importance of this program is recognized in fields such as economics, political science, or psychology, it is largely lacking within CSS.

There are two areas systematically introducing opaqueness into computational social science:

  • data underlying research projects, and

  • transparency with regard to the robustness and inner workings of advanced methods.

One of the central selling propositions of CSS is its use of large and rich data sets. These data sets often stem from commercial online platforms. Accordingly, they come with various concerns regarding the privacy of users whose behavior is documented in them and intellectual property rights of the companies providing researchers with access to them. This brings two challenges: First, how do we ensure access to relevant data for researchers in the first place; and second, once access has been granted, how can researchers provide others access to said data to double check their findings. In these cases, rules set by platforms governing access to proprietary data can serve as cloaking device, rendering data underlying highly visible CSS research intransparent. Here, the field has to become more invested in developing data transparency standards and processes. This might mean pushing back against some of the often arbitrary rules and standards of data access set by platform providers. Those are often designed with commercial uses in mind and serve primarily to protect the business interests of platforms and their public image instead of serving the interests of their users or society at large by enabling reliable and valid scientific work.

Another area of opaqueness in CSS arises from the use of advanced computational methods in an interdisciplinary context. The different disciplines at the intersection of CSS come with different strengths and sensibilities. While typically, there is high comfortableness and skill among computer scientists in software development and the use of quantitative methods, social scientists typically are more interested in addressing actual social instead of predominantly technical questions. This brings the danger of scientists primarily driven by interests and sensibilities in social problems uncritically using analytical tools provided by computationally minded colleagues without critically reflecting on these tools' inner workings and boundary conditions. In the worst case, this can lead to social scientists misdiagnosing social phenomena based on an uncritical and unreflected use of computational tools and quantitative methods.

At the same time, the development of robust methods in CSS is hampered by a prototype-publication culture. Researchers are incentivized to publish innovative methods which once published are treated as proven by the field. Critical testing of methods and their implementations in code across varying contexts is currently not encouraged by publication practices of the leading conferences and journals in the field. This inhibits the development of a robust collective validation effort of methods and measures.

Already this brief sketch of the typical CSS project pipeline shows the diversity and richness of computational social science. The field is neither defined by specific data types or analytical methods. Rather, CSS is a broad research approach embracing different methods and perspectives. Individual researchers or even most mono-disciplinary teams cannot convincingly represent this diversity. The future of CSS lies in the interdisciplinary merger of the various social sciences, computer science, and natural sciences. This is easier said than done. As anyone who has tried it will tell you, interdisciplinary research is easy to talk about but difficult to practice. To get better at this, it is important to collect and document specific experiences of different projects or research teams. Some documentations are starting to be published. But this can only be the beginning of a systematic reflection.