2.4. Digital trace data: Typical approaches in computational social science 2

2.4.1. Digital trace data

As we have seen, in computational social science there are great hopes and enthusiasms connected with the availability of new data sources. Particularly one new data source features in these accounts: digital trace data.

Once people interact with digital devices (such as smart phones and smart devices) and services (such as Facebook or Twitter), their digitally mediated interactions leave traces on devices and services. Some of those are discarded, some are stored. Some are available only to the device maker or service provider, some are available to researchers. This last category of digital trace data, those that are stored and available to researchers, has spawned a lot of research activity and enthusiasm over a new measurement revolution in the social sciences. But somewhat more than ten years into this "revolution", the limits of digital trace data for social science research are becoming just as clear as their promises. Before we look at studies using digital trace data, it is therefore necessary that we look a little more closely at what they are, what characteristics they share, and how this impacts scientific work with them.

In their 2011 article "Validity Issues in the Use of Social Network Analysis with Digital Trace Data" James Howison, Andrea Wiggins, and Kevin Crowston define digital trace data as:

"(...) records of activity (trace data) undertaken through an online information system (thus, digital). A trace is a mark left as a sign of passage; it is recorded evidence that something has occurred in the past. For trace data, the system acts as a data collection tool, providing both advantages and limitations. The task for using this evidence in network analysis is to turn these recorded traces of activity into measures of theoretically interesting constructs."

[Howison, Wiggins, and Crowston, 2011], p. 769.

Replace the term network analysis in the last sentence with social science and you have the crucial task in working with digital trace data before you. This translation of "traces" into "theoretically intersting constructs" demands for accounting for specific characteristics of digital trace data. Back to Howison and colleagues:

"1) it is found data (rather than produced for research), 2) it is event-based data (rather than summary data), and 3) as events occur over a period of time, it is longitudinal data. In each aspect, such data contrasts with data traditionally collected through social network surveys and interviews."

[Howison, Wiggins, and Crowston, 2011], p. 769.

Again, replace social network with social science and you are good to go. It is probably best you go, find the article, and read this section of the text yourself, but let me briefly explicate the concerns expressed by Howison and colleagues as they relate to social science more generally.

First, digital trace data are found data. This makes them different from data specifically designed for research purposes. Usually, social scientists approach a question through a research design specifically developed to answer it. You are interested in what people think about a politician? You ask them. You are interested in how a news article shifts opinions? You design an experiment in which you expose some people to the article but not others and later ask both groups for their opinion on the issue discussed in the article. With digital trace data, you do not have that luxury. Instead, you often start from the data and try to connect it back to your interests. What do people think about a politician? Well, maybe look on Twitter and count her mentions. If you really get fancy, maybe run the messages through a sentiment detection method and count the mentions identified as positive and those identified as negative. Want to identify the effects of a news article? Check if people's traces change after exposure. Already these two examples show that found data can be used in interesting ways. At the same time, you often have to compromise in working with them. Thinking purely from the perspectives of research design and identification approach, digital trace data will often leave you frustrated as they simply might not cover what you need. On the other hand, once you accomodate yourself with what signals are available to you in found data, you might land on new questions and insights that following a purely deductive approach, you might have missed. This is especially true for questions regarding the behavior of users in digital communication environments or the inner workings of said environments.

Second, digital trace data are event data. They document interactions and behavior in collections of single instances. Like a Facebook page, write and @-message on Twitter, comment on a post on Reddit, edit a Wikipedia page, click on a site, on and on. These events can carry a lot of information. For example, measuring the impact of an add through clicks of featured links is a perfectly good approach, as far as this goes. But in social science, we often are interested not only in specific interaction or behavior events. Instead, we ask for the reasons underlying these events, such as attitudes or psychological traits. To get at these, we need to understand how users interpret their action that led to a data trace. Take one of the examples from earlier: If we want to understand public opinion on a politician or a political topic, researchers often look for mentions of an actor or topic in digital trace data. But the motives for mentioning a politician or topic on a digital service vary. One could express support, critique, neutrally point to a topically connected event or quote, or one could try and be funny in front of friends and imagined audiences. Some of these motives might be identifiable by linguistic features around the term, others might not. In connecting the event visible in digital trace data, mentions of actors or topics, with the concept of interest, attitudes toward them, means taking into account the data generating process linking the documented event to the concept of interest. This step is crucial in the work with digital trace data but often neglected in favor of a naive positivism, de facto positing that digital traces do not lie and speak to everything we happen to be interested in.

Third, Howison, Wiggins, and Crowston [2011] point to digital traces demanding for a longitudinal perspective. Data documenting singular events need to be aggregated in order to speak to a larger phenomenon of interest. But what aggregation rule is appropriate or not? That is open to question. For example, many sociologists are interested in the effects of friendship relations between people. Friendship is an interpretatively demanding concept. This raises many measurement problems. Some people interact with people they consider friends often. Others interact with people they consider friends only seldom but hold deep affection and trust. Just looking at interactions in person, on the phone, or online will therefore not necessarily tell us, who the people are our subjects consider friends. Traditionally, sociologists would survey people to identify people they themselves identify as friends. They therefore have access to the result of the personal calculus of respondents over all interactions and experiences with a person resulting in their assessment of the person as friend or not. Simply looking at digital trace data, as for example email exchanges, public interactions on Twitter, or co-presence in space measured by mobile phones or sensors only provides us with single slices of this calculus. Leaving us to guess the aggregation rule translating single events visible in digital trace data into the latent concept of interest. This is true for social relationships, as friendship, but also for other concepts of interests, such as attitudes or traits. Researchers need to be very careful and transparent in how they choose to aggregate the events visible to them in digital trace data and take them as expression of their concept of interest, especially if its an interpretatively demanding concept.

Finally, Howison, Wiggins, and Crowston [2011] emphasize that digital trace data are "both produced through and stored by an information system" (p. 770). This is important to remember. It means that both the recording of the data as well as access to it depend on the workings of said information system and the organization running it. For one, this means that it is important to differentiate between social and individual factors contributing to an event documented in a data trace and features of the underlying information system. An example for an individual factor leading to a data trace could be my support for a given candidate that makes me favorite a tweet posted on her Twitter feed. Alternatively, a system-level feature could be an algorithm showing me a tweet by said candidate prominently in my Twitter-feed in reaction to which I favorite it in order to read it later. The are two different data generating processes driven by different motives but not discernable by simply looking at the digital trace of the event.

The second consequence of the prominent mediating role of the information system is our dependence of its internal reasons for recording and providing access to data traces. Researchers depend on information systems and their providers in providing them with access and setting access rules. This can be comparatively rich, as currently is the case of Twitter, or comparatively sparse, as is currently the case with Facebook. In any case, shifts in access possibilities and rules are always possible and do not need to follow coherent strategies. This makes research highly dependent on the organizations collecting and providing access to data and introduces a highly troubling set of concerns regarding ethics of data access, conflicts of interests between researchers, organizations, and users, and the transparency and replicability of research findings.

While these challenges persist in the work with digital trace data, this new data type has found a prominent place in social and political science. Unfortunately, the degree to which these challenges are reflected in actual research varies considerably. Nevertheless, let's take a closer look.

2.4.2. Digital trace data in political science

Digital media have led to far-reaching changes in social life and human behavior. These new phenomena lead to new research questions, which digital trace data have been used to address. Areas in political science, that have seen the strongest impact of digital media and research using digital trace data include the practice of politics, political communication, structures and dynamics in the public arena, and discourses.

Various studies using digital trace data focusing on politics and political communication have addressed the behavior of political elites, partisans, and the public. This includes the use of digital media in political campaigns, protests, or the coordination of citizens and civil society.

Other studies examine structures of the public arena, media use, and discourses in digital communication environments. This can be, for example, investigating how people use the media. While the traditional approach would exclusively ask people in surveys for their media usage patterns, digital trace data offer powerful additional perspectives through greater reliability and greater resolution. An example for this are web tracking data that track and document website visits by respondents. In addition to new perspectives on the use of news sources, digital trace data also offer new perspectives on the influence of different media types and sources. Here, examining agendas of the most important issues in different digital and traditional media and their mutual influence dynamics are promising research areas.

More specific to digital media still are studies focusing on usage dynamics and behavioral patterns of people in their use of digital services. This focus area lends itself especially well for the study through digital trace data. Examples include studies focusing on Facebook, Reddit, Twitter, or YouTube. Beyond the study of commercial services, digital trace data have also been successfully been used in studying the behavior of people on e-government services, such as online petitions.

Other studies are using digital trace data to examine the ways governments react to the challenge of digital media. In particular, authoritarian states see themselves increasingly challenged in their control of the public by digital media and the associated new possibilities for information and coordination of their population. Digital trace data have provided researchers with promising instruments for documenting and examining digital media provision and attempts at government control in different countries.

Beyond the study of phenomena directly given rise by the use of digital devices and services, researchers are also trying to infer general phenomena based on signals in digital trace data. Examples include the estimation of political alignments of social media users or the prediction of public opinion or election results. While often original and sophisticated in the use of methods, the validity of resulting findings are contested since they often risk misattributing meaning to spurious correlations between digital traces and larger societal phenomena.

Now, let's look at two studies a little more closely to get a better sense of how work with digital trace data actually looks.

2.4.3. Making sense of online censorship decisions

The most direct way to use digital trace data is to learn about the digital communication environment they were collected in. But before your eyes glaze over now in expectation of another starry eyed discussion of hashtag trends on Twitter, there is more to this than the good, the bad, and the ugly of social media. For example, looking at what happens on digital media can tell us a lot about how states regulated speech or try to control their public. One important example for this is China.

By now, there are a number of highly creative and instructive studies available that use data collected on digital media to understand the degree, intensity, and determinants of Chinese censorship activity. In their 2020 paper "Specificity, Conflict, and Focal Point: A Systematic Investigation into Social Media Censorship in China" the authors Yun Tai and King-Wa Fu examine censorship mechanisms on WeChat.

WeChat is an important social media platform in China, which in December 2019, was reported to have more than 1.1 billion monthly active users. It is an umbrella application that bundles many different functions for which Western users would have to use different applications. For example, WeChat allows, among other functions, blogging, private messaging, group chat, or e-payment. Users and companies can publish dedicated pages on which they can post messages and interact with others.

WeChat provides no standardized access to its data through API. So the authors developed a dedicated software to crawl the app, which they termed WeChatscope. The software uses a set of dummy accounts that subscribe to WeChat pages of interest. New URLs posted on these pages are saved and then visited and scraped hourly continuously for 48 hours. At each visit by the crawler, the sites pointed to by the URLs are scraped and meta-data and media content downloaded and saved in a database. If a page disappears, the software saves the official reason for removal given by the platform. The reason "content violation" is given in cases were content is deemed a violation of related law and regulation.

For their study, Yun Tai and King-Wa Fu collected 818,393 public articles on WeChat that were published between 1 March and 31 October 2018. These articles were posted by 2,560 public accounts. Of those 2,345 articles were removed for "content violation". These articles are what the authors are interested in. More precisely, they are interested in how these articles censored by Chinese regulators differed from others. In order to do so, they first decided to pair each censored article with a non-censored article published on the same account that topically was as similar as possible to the censored article. To identify those pairs of censored/not-censored articles, the authors ran correlated topic models (CTM). This left them with 2,280 pairs of articles published on 751 accounts. To identify the potentially minute difference between articles that led to censorship, the authors used a random forest model. The approach is well suited for identifying meaningful signals in large numbers of input variables.

Using textual terms to predict censorship decisions, the authors identified "perilous" words, terms whose appearance was more frequent in censored articles than remaining articles. They further differentiated between general terms and those that were unique identifiers of entities (such as place names or organizations), times, or quantities. They found that these "specific" terms were especially perilous, increasing the probability for the censorship of articles considerably.

The authors go on and add to this analysis in further steps. But for our purposes, we have seen enough. So, let's stop here.

The authors connect their very specific findings to literature on conflict, multi-party games, and coordination. Based on the considerations from these theoretical literatures, they conclude that Chinese censorship reacts strongly and negatively to "specific terms" as those might serve as focal points for subsequent coordination or mobilization of users. Thus not only ideas are suppressed but also the linguistic signifiers allowing for coordination of people around specific causes or places.

The study of government censorship is a continuously moving target, as it remains the hare and hedgehog race between the censor and the censored. The study by Yun Tai and King-Wa Fu is therefor surely not the final word on internet censorship or Chinese censorship. Still, their study provides an important puzzle piece in this debate. More important for us, their study provides a highly creative and instructive example of how to use data collected on digital media for the study of speech and government control. Further, it is also an interesting case of how to connect the highly specific and often abstract results from computational text analysis with more general theoretical debates in the social sciences.

All the more reason for you to read the study yourself.

2.4.4. It's attention, not support!

Sometimes, we do not only want to learn about what happens in digital communication environments. Sometimes, we want to learn about the world beyond. Digital trace data can help us also in the pursuit these questions. But we need to be a little more careful in reading them, in order not to being misled. A look at studies using digital trace data trying to learn about public opinion is instructive.

Many people use social media to comment publicly about their views about politics or comment on current events. Taking these messages and trying to learn about public opinion could be a good idea. Right from the start, there are some obvious concerns.

First, not everybody uses social media and those who do, differ from the public at large. Also, not everybody who uses social media posts publicly about politics or news. So we are left with an even smaller potentially even more skewed section of the public whose public messages we base our estimate of public opinion on.

Second, public posts on politics and the news are public. This might seem like a truism but is a problem for studying public opinion. Many people hold opinions about politics and the news but only a very politically active and dedicated person will posts those publicly, especially in the case of political controversy. By publicly commenting about politics on social media, people demonstrate their political allegiances and convictions for all the world to see. This includes family, friends, colleagues, competitors, and political opponents. Everyone can see what they think about politics and are invited to comment, silently judge, or screenshot.

Third, not only academics are turning to social media to learn about public opinion. For example, journalists, politicians, and campaign organizations are all watching trends and dynamics on social media closely to get a better sense of what the public thinks or is worried about. We might worry about the power they give social media to influence their actions or thinking, given the biases listed above, but this will not make them stop doing it. So anyone publicly posting about politics might not be doing this to express their honest and true opinion. Instead, they might be doing it tactically to influence the way journalists or politicians see the world and evaluate which topics to emphasize or which positions to give up.

As a consequence, studying public opinion based on public social media messages means studying comments, links, and interactions by a highly involved, potentially partisan, non-representative group of people, a portion of whom might be posting tactically, in order to influence news coverage or power dynamics within political factions.

Anyone looking at these obstacles and still thinking, social media posts might be a good way to learn about political opinion truly must be fearless. But as it turns out, there are many who try to do just that. Could it be that they are right?

In joint work Harald Schoen, Oliver Posegga, Pascal Jürgens and I decided to check what public Twitter messages can tell us about public opinion and what they can't. In the 2017 paper "Digital Trace Data in the Study of Public Opinion: An Indicator of Attention Toward Politics Rather Than Political Support" we compared Twitter-based metrics with results from public opinion polls.

To get access to relevant Twitter messages, we worked with the social media data vendor Gnip. We bought access to all public Twitter messages posted during a time span of three months preceding Germany's federal election of 2013 containing mentions of eight prominent parties. Given Twitter's current data access policy, we probably would have simply used the official Twitter API to identify and access relevant messages. Back then, this was not possible. As we were only interested in public opinion in Germany, we only considered messages by users who had chosen German as interface language in interacting with Twitter. This choice might underestimate the total number of messages referring to the parties in question but the resulting error should not systematically bias our findings.

We then calculated a number of Twitter-based metrics for each party following prior choices by authors trying to infer public opinion during election campaigns based on Twitter. This included the mention count of parties in keywords and hashtags, the number of users posting about parties in keyword or hashtag mentions, the number of positive and negative mentions, and the number of users posting positively or negatively about a party. To identify the sentiment of messages mentioning a party, we used a Twitter convention prevalent in Germany at the time. Users identified a message as being in support or opposition to a party by using its name in a hashtag followed by a + or - sign (e.g. #cdu+ or #csu-). This choice might not be replicable across other countries or time, but it is a robust approach to check whether for our case explicitly positively or negatively tagged messages are more strongly connected with public opinion than normal mentions.

This variety of considered metrics reflects the challenge in working with digital trace data discussed by Howison, Wiggins, and Crowston [2011] under the term event data. We might have objective ways to count the events in which users mention political parties in a specific way but we have to interpret what this event tells us about their intentions or attitudes.

Continuing, we then compare our Twitter-based metrics with the vote share of each party on election day and compared the resulting error with that of opinion polls. All chosen Twitter-based metrics perform massively worse than estimations based either on the results of the previous federal election in 2009 or polling results. Calculating metrics per day instead of aggregating them over the whole time period shows this error fluctuating strongly over the course of the campaign, with no apparent improvement.

Again, this finding connects back to the concerns raised by Howison, Wiggins, and Crowston [2011] with the term longitudinal data. Mentions of parties happen over a long time period before an election. There are no fixed rules by which to decide over which time period to aggregate the mentions to calculate a metric to predict election outcomes. Any choices in this regard are therefore arbitrary.

We get a better sense of the nature of party mentions on Twitter by looking at their temporal distribution. The number of party mentions spiked in reaction to highly publicized campaign events. People commented on politics in reaction to campaign activities, candidate statements, and media coverage. Twitter mentions were therefore indicative of attention paid to politics by politically vocal Twitter users but did not allow to infer the public's attitudes or voting intention.

The paper thus shows that there is indeed something we can learn about the world by looking at social media, it just might not be everything we tend to be interested in. In the case of Twitter and public opinion, it looks like Twitter data can show us what politically vocal people on Twitter were paying attention to and were reacting to. But it did not allow for the inference of public opinion at large or the prediction of parties' electoral fortunes. Getting this distinction right between what we wish digital trace data to tell us (or what other people claim they tell us) and what they actually can tell us giving underlying data generating processes and their inherent characteristics is important. Only by addressing this challenge explicitly and transparently will work based on digital trace data mature and achieve greater recognition in the social science mainstream.

The two studies chosen here offer only a small spotlight on the possibilities in working with digital trace data. As was the case with the examples chosen to illustrate work based on text analysis, there are many other interesting and instructive studies out there. So do please not stop with the studies discussed here, but read broadly to get a sense of the variety of approaches and opportunities open to you in working with digital trace data.