3 Data
Data are crucial for the discussion of politics and digital media. Understanding the core concepts and issues arising from the quantification of social and political life and the resulting data is important for engaging in many of the subsequent controversies of the uses of digital media in politics. Digital media, devices, and sensors collect data documenting the world, society, and human behavior. This has been seen by some as a measurement revolution, providing many new avenues for the social sciences as well as new business opportunities in the economy. Perceived potentials and dangers in the increases in the volume and breadth of coverage of digital data are broadly discussed, but it is also important to examine how these new data sources relate to the social or behavioral phenomena they supposedly cover. New data riches have to be translated into meaningful measures of phenomena of interest and society.
Digital data are the newest step in the quantification of reality and social life. This leads to three important questions, that need to be addressed in research: How do we turn the world into numbers? What do these numbers enable us to do? And how do we as a society structure this process and define rules and regulation of what is and is not allowed? These are powerful questions framing research projects and they will help us structure this chapter.
This chapter will start by presenting core issues arising from the quantification of social and political life. Following this, it will introduce readers to hopes and limits connected with the term big data. This will be followed by the discussion of fundamental questions of scientific measurement. Once these foundational concepts and questions are discussed, the chapter will turn to how political organizations and other actors are translating the world into data and are using data to better understand their environment, the effects of their actions, control their members, and improve their work. This then will form the basis of discussing the trade-offs between increasing the capacity of organizations and states through data and the considerable privacy concerns of people whose data are collected, processed, and potentially shared with others.
The chapter will introduce students to core issues to data collection, use, and governance through digital media and devices. They will learn core concepts in the discussion about data and measurement and will encounter key examples and trade-offs that need to be considered in the examination and discussion of data uses by political organizations and actors and the governance of data collection and data use through regulators and the state.
3.1 Data and quantification
Computation and digital media have reinforced interest in the opportunities and challenges of data and quantification in various societal fields. Data promise the objective observation of the world, insights into hidden patterns and causes, and foresight into future developments or the outcomes of specific choices or interventions. Data are crucial for making the world legible and changing it through targeted interventions. In this sense, data provide the basis for the modern world and its scientific understanding.
Digital media and computation have increased the availability of data in ever more fields. They also have extended analytical opportunities. It is no surprise then to find digital data to be a topic of both enthusiasms as well as fears. But before we can take a closer look at both hopes and fears, we first have to be clear about what exactly we mean by the term data and what the preconditions are for data providing a true representation of the world.
Data are symbolic representations of entities in the world and their relationships. They are the result of some measurement process that maps entities’ properties to numerical values of a variable. The numerical relationship between variables represents the relationship between entities in the real world.1
This definition points to important features of data that make them useful but also constitute limits to their use. Data provide reduced symbolic representations of specific characteristics of entities of interest. Numeric symbolic representation allows for the documentation of entities, events, or behaviors. Mathematical calculation allows the identification of causal or correlative connections between recorded entities. Models developed based on these connections allow the prediction of likely outcomes given specific inputs. Data are therefor an important feature and allow for a deeper understanding of the past, control of the present, and even provide a glance of the future. But to do so, we first must translate entities of interest in the world into reduced symbolic, numerical representations. This process is called quantification.
Quantification refers to the process of translating entities, or selected characteristics of entities, into numbers.2
Quantification, the reduction of entities to numbers, is powerful. It allows to document past and present, as well as to plan for the future without having to account for the richness of entities in the world. The translation of entities of interest into numbers provides a model of entities, their relationships, and potentially the world. It also allows for the performance of mathematical operations. These operations do not only speak to the numeric representation, the model, but also allow for inferences on the underlying entities and phenomena, potentially uncovering otherwise hidden patterns and causal relations, or the prediction of future developments or effects of interventions. Data not only represent the world but also allow for targeted interventions.
While quantification offers impressive new opportunities, it also has limits. In reducing entities in the world to numbers, quantification reduces the world to countable signals. This means disregarding much of the world. This can be useful, but this can also be dangerous. In the process of quantification unimportant features of entities might be counted, while important might be missed or disregarded. Or in the process of reducing entities to numbers, their essence might be lost. Accordingly, anyone working with data will have to consider the underlying data generating process and if necessary critically interrogate it. We will return to this, when we will be looking at measurement.
Also, quantification not only represents the world, in some sense it is recreating it. Quantification assigns entities, or selected features, to categories. These categories then become the structuring devices people use to learn about the world and shape it through interventions. The definition of categories and the interpretative mapping of entities and features become important constitutive features of data-driven reasoning. While the analysis of data follows objective rules of probability and mathematics, the process of defining categories and assigning entities is constructive and interpretative. Accordingly, these aspects of quantification need to be subject to critical interrogation.3
Quantification and resulting data promise insights about past, present, and future. But to distill insight from data, we must not only analyze data and their inherent relationships. We also need to actively account for limits of quantification and measurement approaches in relation to our interests. Otherwise even the most impressive efforts in quantification and quantitative analysis will lead to mistaken views about the world.
3.2 Big data
Digital media and digital technology more broadly have provided society and researchers with new data sources, that appear to be cheap and ubiquitous. In the past, collecting data was difficult and expensive. People had to actively go out, count, and record their objects of interest. In social science, they had to run expensive surveys in which people visited or called respondents and had them answer questionnaires. This made data expensive to collect.
Digital technology changes this. Existing, analogue data sets are digitized and thereby made broadly available and accessible for computational analysis. At the same time, digital technology collects original data continuously. Examples include digitally enabled sensors that collect information about their environment and translate signals into data. Sensors like this can be found in cars, mobile phones, and many internet enabled smart devices. Additionally, people’s interactions with digital services create data traces, documenting their behavior.
These data are collected and used by device makers and service providers and can serve as basis for the improvement of existing devices and services or the creation of new ones. While these data sources are primarily available to commercial actors, academics can gain access to selective slices of these data through standardized open protocols – such as application programming interfaces (APIs) defining rules and limits of public data access. Alternatively, academics can pursue privileged collaboration with companies. Digital technology thus promises a new data abundance for business, society, and academia. Data are suddenly cheap. This created a lot of enthusiasm among business consultants, journalists, and some scientists about the supposed potentials of unknown data riches. The term big data has become the focus point of these enthusiasms.4
Originally, the term big data referred to large data sets that technically could not be held or processed in one database or on one machine. But soon it gained popularity as a term covering the new data riches provided through digital technology and the associated economic benefits. One of the earlier characterizations of big data was proposed by the business consultancy Meta Group and – after its subsequent acquisition by the consultancy Gartner – has become known as the Gartner definition.
““Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Laney, 2001).
Since the original formulation of this definition in 2001, many additional Vs have been suggested in covering new or neglected characteristics of big data. But the original three Vs – volume, velocity, and variety – should suffice for our purposes. They illustrate the suspected promises of big data very clearly, while also pointing to one of the crucial shortcomings of subsequent efforts and debates. The definition points to its origin, being interested in the technical issues arising from handling data sets made available through digital technology. Data come in great volume, surpassing the capabilities of standard computational set-ups and statistical methods. Data come in great velocity, on the one hand allowing for real-time analysis of unfolding phenomena, but at the same time also providing challenges by shifting features within the data and data generating processes. Finally, data come in great variety, such as text, image, video, audio, or meta-data. As a consequence, data need to be structured in order to allow for subsequent analysis.
The definition puts its focus clearly on the technical features of big data, and not their characteristics as representation of objects in the world. This can be forgiven, given the definition’s origin with a technical consultancy trying to prepare its clients for future opportunities or challenges. Unfortunately, the same focus on technical features, simplistic fascination with size and volume, and an overwhelming disregard for issues arising from the translation of entities in the world in symbolic representation have dominated the subsequent debate and use of the term. This is less forgivable.
Nearly all discussions of big data treat them as true representations of whatever happens to be of interest to the speaker. This could be buying intentions, psychological traits, or political affiliations. Whatever happens to be the desire of the inquirer, big data shall provide. If we would follow the big data boosters, we could expect to find everything and all in big data, no matter what we are looking for.
This is of course not the case. What goes for other types of data also holds for big data. The translation of entities into symbols entails the construction of their representation. Quantifying the world depends on creatively translating objects of interest into measurable signals and translating those into data. These important interpretative steps are just as important in the age of big data as before. Arguably they are even more important now than ever, since people now have access to data not primarily collected with their analytical goals in mind. To realize the potential of these found data for research, demands for very active and creative steps in quantification. Lets take a look at one of the most prominent and talked about categories of big data: digital trace data.
“(…) records of activity (trace data) undertaken through an online information system (thus, digital). A trace is a mark left as a sign of passage; it is recorded evidence that something has occurred in the past. For trace data, the system acts as a data collection tool, providing both advantages and limitations.” (Howison et al., 2011, p. 769)
Examples for digital trace data include messages and metadata documenting contributions, behavior, and interactions of users of social media services, like X, Reddit, or YouTube. Due to their comparatively easy accessibility through application programming interfaces (APIs), data like these have become prominent in the work of social and computer scientists. Additionally, they provide the basis for many consultancy and media analysis services. Digital trace data therefor have come to provide representations of social systems and human behavior for academic work as well as professionals in media and other businesses.
But there are specific challenges to symbolic representations of the world based on digital trace data. While early enthusiasts expected digital data sources like these to provide all-encompassing insights into human behavior and social systems, time has surfaced severe limitations and sources of bias. Two sources of bias are especially relevant in working with big data built from digital trace data: biased coverage and biased behavior.
A bias in a data set refers to a systematic error or deviation from a representative sample. It occurs when a data set doesn’t accurately represent the population or domain it claims to, leading to skewed results. For example, a dataset claiming to represent the general population would be biased if it primarily consisted of individuals between the ages of 18 and 36.
Data collected by digital services, while high in volume, are still limited by who is using these services to begin with. This means that a service’s active user base determines the share of social life that is covered by data collected on it and that can meaningfully quantified and modeled. The active user bases of even the biggest social media services are highly concentrated within specific demographics and systematically exclude specific socio-demographic groups. This is true for age, by systematically underrepresenting older people, but potentially also for political leaning, with different services catering to partisans of different political stripes. Some of theses biases in the coverage of these data sets can in principle be identified, other remain hidden be it for lack of information about users or be it for unobserved shifts in the composition of the user base contributing to a given data source.
There is also the risk of measuring biased behavior through digital trace data. Digital trace data only offer us access to behavior of people mediated through the affordances, interfaces, and code of the providers of digital services. For example, we might be interested in the attitudes of X users. But the only thing we get is their public posts. These might be true expressions of opinions, thoughts, and reactions. But they might just as well be the results of a strategic public performance, in where users play a public role and post fitting messages. Or they might be the result of behavioral incentives provided by the services themselves, such as algorithmic content selection.
These are just a few examples for how the information system recording and providing data is also shaping data. So in the end, we cannot be sure, that we measure human behavior beyond the digital environment it was collected in. For the purposes of using the data for inferences beyond the confines of the system generating them, the data therefor are clearly biased and unsuited.5
These limits of digital trace data, and big data more generally, are beginning to feature more clearly in the debate. Early on, these concerns were largely ignored in a spirit of daring can-do. But over time, the limits of work based on data like these became more apparent. Also, the more these data came to matter beyond the confines of academia, the more the accuracy of diagnoses and prediction based on these data came to matter. An important factor in this was the growing use of large data sets in applications and services enabled through artificial intelligence (AI).6
The richness and volume of newly available digital data are a core contributor to recent advances in artificial intelligence (AI). With the growing awareness of AI-enabled opportunities as well as associated risks, biases in data sets have received increasing attention. At the core of associated concerns are fears of biased data sets leading to biased outcomes of decision making based on them. Especially automated algorithmic learning and decision making have received much attention in this discussion.7
Questions of who and what gets counted in big data are thus increasingly getting more attention. Still, there remains much to do to overcome the current naive positivism prevalent in the work with big data. One core issue in the work with big data, and in quantification in general, is how the entities in the world get translated into symbols. This brings us to the question of measurement.
3.3 Measurement
By translating observations into numbers, quantification provides opportunities for new and important insights about the world and objects of interest. But as with any translation, making entities and phenomena countable means also losing some of their features. Quantification makes some things visible, while hiding others. To better understand this process, we have to examine how things become numbers, we have to examine measurement.8
“The assignment of numbers to represent the magnitude of attributes of a system we are studying or which we wish to describe.” (Hand, 2004, p. 3)
Today, measurement is pervasive and underlies much of contemporary life. The economy runs on data, models, and prognoses allowing for the optimization of production, the planning for inventories, or the setting of prices. States use them for policy planning, designing interventions, and the allocation of state resources. Scientists depend on them for the explanation of the world and the prediction of trends. We ourself, carry personal trackers that measure our movements and heart beat to determine our fitness and track our training progress. This pervasiveness of measurement makes it difficult to imagine that this might have been different.
The historical roots of measurement and data are very prosaic. They lie in the late Middle Ages and start with accounting. International merchants started to rely on numbers to keep track of orders and inventory, allowing them to run intricate and far-reaching international trade networks. Soon states started to adapt to these innovations in order to increase their ability to collect taxes and raise armies. The foundations of measurement are therefor very practical and lie at the heart of the processes that gave rise to the modern society.9
Looking more closely at measurement, we can differentiate between two measurement approaches: representational measurement and pragmatic measurement.
On the most fundamental level, measurement can be the mapping of empirical relationships between distinct objects by quantifying specific observable attributes. The resulting numerical relationships represent the empirical relationships between the objects. This is representational measurement.10
For example, we can examine the relative strength of a protest movement over time by counting participants in protest events at different points in time. If we count more participants at events over time, we can conclude that the movement gains strength. If we count fewer participants over time, we can infer the opposite. This type of measurement is straightforward. Numerical values assigned during the measurement process are constrained by the relationship of empirically observable characteristics. But as we will quickly find out, most phenomena of interest – especially in the social sciences – do not lend themselves to this direct measurement approach.
The assignment of numerical values to variables, in such a manner that the numerical relationships between variables corresponds to empirically observed relationships between measured entities. This assignment is not arbitrary; instead, it’s directly informed and constrained by the patterns and properties observable in the actual entities being studied.
Things become a little more difficult in the measurement of phenomena that do not lend themselves to direct observation but that still merit quantification. The state of the economy, collective happiness, public opinion, people’s psychological traits or attitudes: none of these concepts can be observed directly but all are subject of measurement. This requires pragmatic measurement.11
Concepts like these do not directly map to empirically observable and directly comparable characteristics. So to measure them scientists have to first decide on how the concept of interest manifests in measurable signals. They have to construct the measurement, which means deciding on what the target concept is and how it should manifest indirectly in observable objects. While representative measurement is directly constrained by empirically observable properties of entities, pragmatic measurement is constrained by shared conventions among those doing the measuring about what constitutes valid measurement approaches and their practicality in use.
The assignment of numerical values to variables to approximate non-directly observable concepts. This is achieved through the theoretical mapping of these concepts to empirically observable indicators. While these measurements are grounded in shared conventions and are valued for their practical utility, their validity is contingent upon the robustness of the link between the theoretical concepts and their empirical proxies. Such connections, and the inferences drawn from them, remain subject to interpretation and critique.
One example for pragmatic measurement from political psychology is the measurement of latent attitudes. For example, libertarianism is an abstract concept, capturing a set of ideas about protecting the rights of individuals against the state.12 If we want to find out whether people agree or disagree with libertarianism, we cannot observe this directly. Libertarians share no physical characteristics making them different from others, egalitarians say, that we could empirically observe. Instead, scientists have proposed a set of statements, each of which corresponding with some aspect of libertarian ideas or convictions.13 Those survey respondents who agree with these statements more strongly or consistently than others, we can label libertarians. Of course, the statements used to measure libertarianism are open to interpretation and critique, whether they truly capture the concept of interest or not. Also, we could use other ways to construct a pragmatic measurement of libertarianism. For example, we could look at digital media posts and classify them as being in accordance or conflict with libertarian ideas. Doing so, would allow us to classify users as libertarians based on their propensity of posting content expressing connected statements or ideas.
Pragmatic measurement is a powerful approach to examine and structure the world, especially with regard to concepts or phenomena that do not lend themselves easily to direct empirical observation. At the same time, pragmatic measurement does not only represent reality, it also constructs it by defining ways on how to measure concepts and phenomena which cannot be directly observed empirically. Accordingly, the process of defining these measures warrants close observation and critical interrogation. Especially with regard to whether the proposed measures correctly and fairly represent the supposed target concepts or wether they unfairly represent institutional, economic, or political interests.
The quantification of the relative strength of factions in political competition offers examples for both representational and pragmatic measurement. The most direct way to determine the relative strength of political factions is counting votes after an election. This is a case of representational measurement. The vote count is a direct representation of the empirically determined votes in favor of either party.
But although this measure has a clear and easily identifiable empirical counterpart, the interpretations of what these votes mean diverge strongly. Some will claim that the winning party has a clear mandate of exercising their platform. If not, why would people have voted for it otherwise? Others will claim that being successful at the polls does not provide any indicator that people actually support the positions of a party. They could merely have voted for it out of protest given their opposition toward the political status quo. Still others might point out that a vote for a party does not speak of policy support or protest. It might simply be a tactical choice to create conditions under which parties might be forced into a coalition government.
These are just three possible interpretations among many others. These examples are not meant to be exhaustive. They simply illustrate that even for data that are the result of a direct and simple representative approach, interpretations of their meaning diverge widely, always depending on the interests and intentions of those doing the interpreting. This is even more so with measures of public opinion.
In between elections political parties’ fortunes might shift. But without votes to count, their relative strengths are hard to quantify. It is here, that public opinion research comes in. Public opinion is an abstract concept with no direct empirical expression. This makes public opinion an object for pragmatic measurement. To determine public opinion at any time people are surveyed, be it for their support of parties, assessment of politicians, or opinions on policies or political topics. While answers in these surveys can be counted, the reading of these numbers is open to interpretation and construction. Accordingly, public opinion surveys serve interested parties as instruments in the pursuit of their goals. The specifics in the measurement of public opinion – the samples drawn, questions asked, weighing procedures applied, and interpretations – are subject for critical reflection and interrogation. The same goes for the impact of the quantification of public opinion and political will, in general.14
It is important to keep the distinction between representational and pragmatic measurement in mind. Both representational and pragmatic measurement translate phenomena into seemingly objective numbers. But while in the case of representational measurement these numbers are constrained by directly empirically observable characteristics, pragmatic measurement defines what empirically observable signals are seen as an expression of underlying empirically unobservable concepts or phenomena. In the best case, this makes invisible but important aspects of the world visible and opens them up for documentation, tracking over time, prediction, and targeted interventions. In the worst case, pragmatic measurement hides its interpretative and constructive characteristics behind the seeming numerical objectivity borrowed from the more narrowly empirically constrained representative measurement approach. In interpreting and critically interrogating data, we therefor need to take into account by which approach they were measured.
Additionally, measurement might fail on a practical level. On a very foundational level, there is the question of a measure’s validity.15 A measure is valid if it captures the phenomenon of interest correctly. For example, asking people to state the frequency of media use in the recent past and treating respective answers as objective would represent an invalid measurement. 16 Personal recollections of something as ephemeral as media use cannot serve as documentation of actual media use.
A measure also needs to be reliable.17 It needs to produce accurate quantified representations of the empirical features of an entity. A simple example of an unreliable measure is a scale that returns wildly fluctuating weight assessments at repeated weighing of the same object. An example for the reliability of a more complicated measurement approach would be if two people tasked with content analysis return independently from another the same content classification. Conversely, if they would fail to do so, the measure would be not reliable.
There is also the issue of bias of measures. 18 Measures are biased if their results contain systematic errors. To stay with the example of a scale, a scale would be biased if it systematically adds or subtracts a specific value to each measurement. For an example for a more elaborate measurement approach we can turn to public opinion surveys. A biased sampling process might lead to specific parts of the population being underrepresented among survey respondents. Accordingly, their preferences cannot be registered, which in turn leads to the return of a biased measure of public opinion. Questions of biased datasets and results of subsequent analyses have gained high relevance in the discussions about artificial intelligence and algorithmic decision making. By relying on datasets that systematically over- or underrepresent marginal or historically discriminated groups in society, risks the continuation of discriminatory or marginalizing practices into present and future by reliance the results of seemingly objective data analyses. We will return to this issue later.
Finally, we also need to consider that in some cases the measured objects react to measurements and change their characteristics. This is especially true for the social sciences. Publishing prognoses on expected economic developments will lead to people adjusting their behavior in order to profit from these developments, thereby potentially reinforcing the predicted trends or rendering them mute. Similarly, publishing data on the prevalence of hate speech on digital platforms can lead to the platforms adjusting their policies. This might mean filtering potential hate speech at the point of publication, deleting following moderation after publication, or shadow banning it making it invisible to all users but the author. These adjustments change the prevalence of hate speech but are also invalidating the previously used measurement approach. Quantifying entities that have the ability to react to measurement approaches and their results can therefor limit the long-term use of these approaches and present non-trivial challenges to the validity of identified patterns or predictions over time.
These limitations and challenges are well understood for traditional measurement approaches with clearly defined measurement targets and instruments. Here, we have clearly defined quality control procedures that allow to assess potential biases or reliability issues. These issues are less well understood – or even acknowledged – for newer measurement approaches, as in the work with big data.
In light of this discussion about measurement, its characteristics, limitations, and challenges, we can conclude that quantification, the symbolic representation of the world in data, does not necessarily lead to an objective and correct model of the world. Instead, measurement can both on a conceptual and on a practical level introduce errors and biases leading data to misrepresent the world and the relationship between objects of interest within it. The importance of data in contemporary societies means that we need to actively account for this by actively and transparently constructing measures as well as critically interrogating those presented by others. This is especially important in the work with new and innovative measurement approaches and data types, as for example those found on and produced by digital media.
3.4 Observing the world through data
Data provide a reduced representation of the world and promise to uncover hidden patterns and connections between entities that remain invisible in their full, unreduced manifestation in the world. This allows people and organizations to make sense of the world, and increase their level of control. This includes the state, political parties, or companies. These actors need to understand unfolding developments, outcomes of interest, contributing factors, and have an understanding of how they are able to shape outcomes in their interest. For this, they need data. By reducing the complexity of society and the world, data make the world legible and thereby provide the opportunity for intervention. This includes rationalizing and standardizing society and the world by creating terms, concepts, and categories.
Data structure entities in the world according to standardized categories, allowing actors to make sense of reality, assess problems, intervene as they see fit, and evaluate their success. But this process of making the world legible means also engaging in what Hand called pragmatic measurement.19 Actors have to actively and creatively translate entities in the world into standardized and countable objects. This holds either for entities that are directly of interest to them or those they take as signals of underlying phenomena they want to track or influence. This is a process of translation and abstraction in which meaning and details are lost. This risks overlooking or misreading important elements. Still, some sort of reduction is necessary to read and interact with society and the world. The sociologist James C. Scott illustrates this in the discussion of the work of state officials:
“Officials of the modern state (…) assess the life of their society by a series of typifications that are always some distance from the full reality these abstractions are meant to capture. (…) The functionary of any large organization”sees” the human activity that is of interest to him largely through the simplified approximations of documents and statistics (…). These typifications are indispensable to statecraft. State simplifications such as maps, censuses, cadastral lists, and standard units of measurement represent techniques for grasping a large and complex reality; in order for officials to be able to comprehend aspects of the ensemble, that complex reality must be reduced to schematic categories. The only way to accomplish this is to reduce an infinite array of detail to a set of categories that will facilitate summary descriptions, comparisons, and aggregation.”
Scott (1998), p. 76–77.
But for this to work, the translation of society and the world into categories and numbers, quantification must validly account for the actual objects of interest and contributing elements. This makes this not an effort in simple representational measurement but instead pragmatic measurement, foregrounding the challenges associated with the later. This goes double for the questions of whether targeted interventions are successful or not. A correct reading depends strongly on the fitness of the pragmatic measurement process underlying the translation of the world in data. If the map of the territory is wrong or unfit, navigating by it will not lead you to your destination.
Many parties in Western democracies support their local chapters and campaigners with digital information systems providing standardized solutions for voter outreach and fundraising. Services like these were first popularized by Barack Obama’s campaign for the 2008 US-Presidential race. The campaign provided its local chapters with a centralized web-enabled service. This allowed the campaign to coordinate voter outreach by creating call- and walk-lists for volunteers targeting likely voters, that the central campaign office had identified as promising. In the subsequent 2012 Presidential race, the functionality was extended by mobile apps by which individual campaigners could log voter contacts, making the visible to campaign headquarters.
Obama’s success and the perceived contribution of data-enabled mobilization efforts popularized these services. In the following years, the functionality of the original services was extended and various new services were developed by vendors affiliated with other parties and in different countries. So by now services like these are a common feature in international campaigns in democracies, all be it with different functionalities and varying importance to the campaign depending on local data-privacy laws, resources spent on development, and the reliance of campaigns on local voter outreach.
What these services share is that they provide campaign headquarters with the opportunity to observe and shape local campaign activity more effectively than before. When in earlier times, headquarters depended on local chapters reporting back to them on how campaigning was going, these digital services allow headquarters to track local activities directly and to coordinate activities. By having volunteers tracking their activities, headquarters get a view in aggregate and detail of how well the campaign is going and the energy in the field. By providing local chapters with outreach priorities, either targeted directly at addresses or more broadly at the street level, campaign leadership can coordinate activities according to centrally decided strategy.
Digital campaign technology therefore clearly supports central campaign bureaucracies in collecting and aggregating data that make local campaign activities visible and allow central bureaucracies to shape them in turn. But the successful application and deployment of these services depend on more than mere technology. What are the legal conditions for campaigns to collected and use data about prospective voters and volunteers? How much resources can an organization spend on development and training? How strongly does a campaign depend on local voter outreach? How high is data-literacy within the organization allowing for meaningful quantification, analysis, and subsequent action? These are some of the contextual conditions that need to be considered in trying to understand these services’ contribution to campaign strategy, operations, and ultimately to electoral victory or defeat.20
One important way for organizations to observe the world and shape it is metrics-based management.
Metrics are quantified measures that allow organizations the tracking, status assessment, and evaluation of predefined processes.
Organizations use metrics to increase efficiency and productivity. They break down their processes in small identifiable steps and track the achievement of these steps together with inputs and outputs. If done well, this allows for the detailed monitoring of important processes, identification of inefficiencies, the design of interventions to improve efficiency or productivity, and central control.
The idea of metric-based management is an offspring of the Taylorist approach to “scientific management”.21 In the early nineteen hundreds, the engineer Frederick Winslow Taylor, advocated the calculation of standard levels of outputs for each job contributing to factory outputs. Workers who hit these metrics or outperformed them were financially rewarded, while those who fell behind were payed less. These ideas became very popular. They were ported by US defense secretary Robert McNamara to quantify and track progress during the Vietnam War, they were the inspiration to metric-based management, moved into the public service sector under the term “new public management”, and recently experienced a revival in Silicon Valley tech companies under the term “objectives and key results (OKRs)”.
The idea behind these schemes is usually the same: define a set of key steps important in the pursuit of an organizations’ goals and track their achievement over time. By tracking inputs – such as raw material or working hours – and outputs – such as units produced – managers can supervise the process, incentivize the behavior of workers, or look for hidden inefficiencies. At least that’s the promise.
In practice, management by metrics depends on the suitability of metrics covering the relevant inputs and outputs. This sounds trivial, but often what is easy to measure is not necessarily relevant, while what is relevant is not necessarily easy to measure. By failing in the pragmatic measurement process, by misconstructing measured signals, metrics-based management can lead an organization to focus on achieving metrics instead of pursuing what is necessary to produce their outputs.22 The fundamental problem in the measurement or quantification of society and the world therefore also hold in these cases.
Through the greater availability and pervasiveness of data, enabled by digital technology, the reach of metrics has extended. This is true for established businesses that now start to look to digital metrics to assess their success of failure, such as news media assessing the success of articles or journalists by the number of views or interactions they generate in digital communication environments. Or this could be actors who up until now had no access to data documenting their success or failures in real-time, such as politicians or micro-celebrities.23 These actors also now find many digital metrics available to them, supposedly tracking their fates while providing them opportunities for interventions targeted at improving it.
By a multitude of publicly visible or private metrics, digital technology provides many actors new opportunities for making legible their environment in which they pursue their goals. Unfortunately, through their simplicity and instant availability, these metrics are often not critically interrogated as to whether they actually speak to the goals of actors using them.
For example, while a politician can easily track likes on Facebook, it is much harder to determine whether these likes actually correspond with preferences of her constituents. So, in the worst case, she might optimize her positions and rhetoric for the audience of her Facebook posts while losing from sight, the preferences of her actual voters.
If not used consciously and critically, metrics can easily create a quantified cage for actors relying on them. By making it easy to optimize toward them, metrics can provide the illusion of control to central authorities using them to govern an organization. But while an organization might do well according to the abstract quantification of the world – metrics – its actual fate might be much less benign. As Scott (1998) points out, while an abstract map is necessary for the centralized control of complex systems, the same map can mislead governing units and navigating through it can lead to failures, small and large. This is just as true for metrics in the age of big data, as for those coming before.
3.5 Shaping the future through data
Symbolic representations of the world do not only allow people and organizations to see the world, they also allow them to shape the future. Building mathematical models of the world, provides new ways to form expectations about relevant future developments as well as effects of specific actions and interventions. Having access to better models about the world allows people and organizations to position themselves with greater foresight in the world regarding future developments and effects of their actions. This allows them to outperform the competition and shape the future according to their interests.
In the given context, a model is defined as the formal expression of structure within data. While a model might represent structures between real-world entities, such a correspondence is not guaranteed. The fidelity of this representation depends on the accurate portrayal of these entities within the dataset in question.
Models can be categorized as:
Descriptive: These models articulate a structure found in data without making claims about causality or explaining relationships.
Mechanistic: These models formally denote relationships between variables based on theoretical expectations. In doing so, they translate real-world theories into formal, numerical representations.
Models rely on the numeric representation of the world. This allows for the mathematical identification of structures between sets of symbols. As far as these symbols provide an accurate representation of the world, structures identified through mathematics should also account for structures between the represented entities in the world. This includes the identification of relationships between variables and the entities they represent, such as their systematic co-occurrence or the presence of one entity causing that of another.
For example large online retailers can analyze buying patters and build statistical models of which products tend to be bought together or subsequently. Of course, more interesting are models successfully identifying events or actions that cause users to buy an item. By having information about exposure to ads, clickstreams, or buying history the identification of causal models might be possible as well. More generally, the same logic applies to recommender systems, that point users to content or products given their own prior browsing or buying history or that of other users that resemble them in for the model relevant features.24 These models allow online services – be they retailers, news sites, or content providers – to shape the information users see in ways that in the past generated outcomes of interest – be they sales, clicks, or interactions. Technology companies shape the future by structuring the digital environments users find themselves in according to insights from models.
Given enough resources, data access, and dedicated personell organizations can capitalize on the opportunities of data analysis. This includes governments, parties, or collective action organizations. Still, while the promises of data-driven practices are often proclaimed and discussed, in fact the realization of their potential carry considerable demands in resources, analytical capabilities within an organization, quantifiable and quantified aspects of interest, and an organizational culture open to integrate data-driven insights and adjust tactics and strategy accordingly. These preconditions are much less often found or even discussed, than broad speculations about data-driven promises. Accordingly the potential of the uses of data and models are often overestimated and need to be critically interrogated for specific cases and contexts.
An important feature of having models about the workings of the world – or at least that part of the world one is interested in – is that models allow to predict future developments or the effects of actions and interventions. As long as they are accurate, of course. Having accurate models about the world allows for the prediction of the future. This includes future large-scale developments – as in models of the economy or the climate – as well as future behavior of people – as in buying decisions or consumption patterns. As David J. Hand puts it:
“The mathematical model represents an understanding of how things work, and this understanding can then be used in other situations where data collection is unfeasible. (…) using mathematical models, based on data which have been collected in simpler situations, we can produce predictions (…).”
Hand (2007), p. 68
Model-based prediction formalizes knowledge gained on the basis of quantified representations of the world. They allow people, organizations, and societies to plan for the future, adjust their behaviors and actions, and potentially even shape it to their benefit. Of course, this does not mean that these predictions will always be correct. But if used diligently and in knowledge of their limitations, they can help.
Campaign organizations and political parties are also experimenting with the use of data-enabled practices and model-based predictions. Again, the Obama campaigns of 2008 and 2012 are often quoted examples that serve as template for expectations for data in politics. But – as always – it pays to look closely.
Much has been made out of the Obama campaign’s supposed ability to identify likely voters by predicting people’s vote choices based on available data. But a close study of the precision of these targeting models by Hersh (2015) showed that the predictive power of the campaign depended strongly on the data available to it. The campaign relied on official voter files. In some states these voter files contained information on whether a voter registered as Democrat, Republican, or Independent. Hersh showed that in states where this data was available the Obama campaign was able to target their outreach much more precisely than in states where these data were not available. The quality of predicting vote choice thus clearly depended on the campaign knowing the self-declared party affiliation of prospective voters.
Other prediction tasks performed better though. In their account of the Obama campaigns’ data-driven practices Nickerson & Rogers (2014) show that the campaigns used experiments to generate data that allowed them to predict the best e-mail wording to maximize donations from their supporters.
Arguably, this practice allowed the Democratic Party to get too good at eliciting money from their supporters. Key consultants and commentators have been arguing that this reduction of political supporters to their likelihood to donate have led the party organization to lose sight of the more meaningful aspects of political organizing and over time weakened the ability of the party to mobilize and count on the support of their volunteers.25 Despite considerable but isolated successes quantification, modeling, and prediction might thus have come to weaken the organization in the mid- to long-term.26
An even more elaborate use of quantified representations and models is the use of simulations. Here, models are used to simulate systems, such as societies, traffic, or human movements in confined spaces. Simulations lie at the core of product development, architecture of public spaces, and design. They are also used in the social sciences in order to test dynamics, interaction patterns, and adaptation within complex social systems.27
Taking stock, model-based predictions allow for the development, testing, and use of new medicine and vaccines. They allow for the planting, raising, and harvesting of crop. They allow people and organizations to plan ahead and try to shape conditions according to their interests. In science they allow the formalization, communication, and testing of theories about the world and over time contribute to improved understanding and cumulative knowledge by exposing inaccurate or wrong ideas. Models, and quantification more broadly, are thus the basis of modern science, business, and life and contribute to our lives being less solitary, poor, nasty, brutish, and short. They allow the shaping of the future not just mere exposure to it.
3.6 Privacy
The natural corollary to quantification and data-driven insight and capabilities is privacy. The simple formula is the more data, the greater the potential analytical insights, the greater the capacity of organizations, companies, or states to make profits, shape people’s option spaces, or the future. People’s interests and rights in keeping aspects of their lives, character traits, interests, and behavior private can appear as an annoying speed-bump on the road to greater capacity and profits. This is especially relevant in the collection and use of data documenting people’s uses of digital devices or services.
New capabilities in data collection, retention, and their uses raise hopes and desires within companies and governments for access to ever more data on ever more users on ever more aspects of the world. But here, interests between companies, governments, and users do not necessarily align and might in fact diverge. Greater capacity of companies and governments might run counter to the interests and rights of people who find themselves documented in data. While state regulators might seem a bastion of protection of people’s rights, it is not clear that their interests necessarily align with that of the people.
For example, Hersh (2015) has shown that in the US for a long time parliamentarians have worked continuously toward extending the data government agencies were obliged to collect in voter files and officially provide campaigns with access to. Also governments have an interest in having access to more data on people, be it to the benefit of crime prosecution and prevention, service provision, or public health. The flip side of greater government access to data is of course greater control over citizens and the potential for repression, especially in autocratic regimes. But trade-offs between government access to data and people’s privacy should never be treated lightly, also in democracies. In short, government regulators do not necessarily have to be natural allies of people in the protection of their rights to privacy.28
Privacy is an inherently contested concept. But the legal scholar Alan F. Westin provides a definition that focuses on the control of information. This definition provides a helpful basis for the discussion of tensions between privacy and availability of digital data.
“Privacy is the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.” (Westin, 1967, p. 7)
Digital technology provides unprecedented opportunities for collecting, aggregating, processing, and disseminating information. But these opportunities mainly hold for companies collecting and disseminating data, not with the people who they document. In fact, these opportunities often directly negatively impact the ability of people to control the collection, use, and dissemination of information about them. The legal scholar Daniel J. Solove has proposed a taxonomy that is helpful in mapping these threats.29
Solove points to a set of potential privacy threats connected with the collection, processing, dissemination of information, and invasion of privacy. All of which feature strongly in the discussion of digital data. Most people are unaware of the data that are collected about them, their behavior, and their contacts when using digital devices or services or by being wittingly or unwittingly documented by digitally-enabled sensors. They also have little idea, never mind control, over how the collected data are processed by data collectors or owners. Data processing can include aggregation of data on people across different sources, the identification of users or their traits by probabilistic inference and associated labeling, or subsequent secondary uses either by the companies collecting the data or those buying access to them. People also have no control over who companies provide with access to their data, be it intentionally or unintentionally. Also, collecting data on people and the data-driven inference of traits and interests might allow companies to influence decision making by people using devices or services to further their own interests over that of their users. The collection, processing, and retention of data with digital technology provides significant threats to the control anyone can hope to have over information pertaining to them.
The by now infamous case of Cambridge Analytica provides a perfect example for the threats sketched by Solove. In the run-up to the 2016 US Presidential election the English campaign consultancy Cambridge Analytica claimed to be able to identify voters’ psychological traits based on their Facebook activity and be able to target them with content optimized to get them to act according to a campaign’s goals. After Donald Trump’s win in 2016, the company claimed this as proof for the success of their approach. By now it seems pretty clear that Cambridge Analytica was not central to the Trump campaign30 and that the approach advocated by them was highly unlikely to bring the claimed effects.31 But the story of Cambridge Analytica is still instructive as a case in which nearly all of Solove’s threats to information privacy come true.32
First, there is the question of data collection. Who collects what data? In the case of Cambridge Analytica, this starts in 2014. A researcher affiliated with a research lab at the University of Cambridge had developed a Facebook App allowing people to test their personality traits. But the researcher did not only collect the responses to the quiz. He used the app to pull additional information through the Facebook application programming interface (API) on the participants and their Facebook contacts. He then connected and stored these information to the responses of the personality quiz. At that point people had lost control over the information collected about them.
The second threat comes from information processing. Again, this case is instructive. The researcher linked signals from the information collected on Facebook to the responses of participants to his personality quiz. He and the people at Cambridge Analytica were convinced that they now had a model allowing them to predict personality traits based on Facebook data. Again, these claims are dubious and highly contested. But for the violation of privacy, the actual validity of the results of data processing does not matter, it only matters that the data are used in ways out of the control of people documented by them.
The third threat is information dissemination. In this case, this happened when the Cambridge researcher provided the company Cambridge Analytica with access to the data collected by his app and the models based on them.
Finally, by trying to influence people in their voting behavior based on models developed on data collected for another purpose Cambridge Analytica also hits the fourth threat: invasion. By actively shaping information environments of people based on data documenting traits and interests, the company interferes with their decision making.
By using digital information systems, like Facebook, Google, TikTok, YouTube or X, users cannot expect to be in control of the information posted in these systems and produced by their use. Instead, the companies running these systems can and will decide how to use these information and which third parties they provide access to the data or its processed derivates. Digital information systems thereby inherently challenge the notion of privacy referenced above.
People, their contributions, behavior and interactions, are tracked in digital information systems and the resulting data are stored and later used. This could be for commercial, political, security, or scientific purposes. Most people will be unaware of these subsequent uses when using digital media. Add to this that many successful digital business models depend on the targeted provision of ads based on users’ traits and interests. While much is unknown about the mechanics behind these models, they rely on broad data collection and processing without necessarily securing informed consent of people documented by that data.33 There is an inherent tension between the interests of companies providing digital services and the rights to privacy of their customers and users.
Accordingly, designing meaningful privacy for these data is difficult. This has inspired much debate and controversy. In response to these issues, the European Union (EU) implemented in 2018 The General Data Protection Regulation (GDPR). The GDPR advances strict limits to the uses of data collected by companies and has influenced other international data regulation. But privacy concerns remain.34
Even if data are used appropriately and according to the wishes of the people they are documenting, privacy remains an issue. For one, there is the question of data ownership. This has come up repeatedly in the context of US campaigns. Imagine you support a candidate during the primaries of your party but ultimately the candidate has to drop out. What happens to the data the candidate’s campaign collected on its supporters, such as e-mail addresses, donation behavior, issue preferences, or contact information? Does the candidate or the contractor delete the data once the run is through? Or do they retain the data? If so, for how long? Or do they decide to throw their lot in with another candidate and transfer the data to them? These are questions that came to matter for the Democratic Party once it had to be decided who could gain access the extensive data the Obama campaigns collected after his last successful race.35
Questions of data privacy do not become easier through an inherent tension between privacy and analytical capacity. While it is easy to prioritize people’s privacy over companies’ increased capabilities to roll out ads, this becomes harder once the trade-offs are between users’ privacy and more important analytical goals. This includes crime prevention or public health. For example during the recent Corona pandemic there were various attempts to use digital media, apps, and trace data to track the spread of the epidemic and infection chains. While promising from the perspective of increasing state-capacity to fight the pandemic, pursuing these options to the full would severely impact people’s privacy by for example tracking their movement and contacts. These concerns needed to be accounted for in the development and deployment of tracking apps and the official use of other available data sources, potentially leading to a loss in functionality and thereby weakened support in the fight against the pandemic.36
Of course states are not only interested in access to data during times of public health crises. Famously, the NSA-spying scandal, following the revelations of US military contractor Edward Snowden in 2013, showed the degree to which the US government was accessing data collected by companies providing digital information systems to users all over the world.37 Even more extensive are the efforts of the Chinese government to collect information about its people and to shape and sanction their behavior. The Chinese Social Credit System is much discussed, although its workings and effects are very difficult to assess confidently. This makes it hard to differentiate between public facing claims and actual workings of social control through quantification and data, leading to a likely overestimation of the capabilities of the system.38
The perceived power of data for surveillance and social control has also given rise to fears of data-enable espionage and foreign influence. This can include new forms of signal intelligence, were states try an capture digital communication of other states or foreign nationals. More contentious are suspicions of espionage and foreign influence enabled by the provision of digital infrastructure through foreign firms. This could recently been seen in concerns voiced in the US of the emerging dependencies of allied states on communication infrastructure provided by the Chinese company Huawei and the growing popularity of the Chinese social media platform TikTok in the West. Reliance on these structures is seen to provide the Chinese state with a backdoor to the information flows and public discussions in other countries. With mounting geopolitical tensions, this is seen to provide data-driven risks.39
The opportunities for quantification and data-enabled insights provided by digital technology are shadowed by making people’s control of information about them increasingly untenable. Accordingly, various scholars such as Spiros Simitis, Helen Nissenbaum, and Julie E. Cohen have argued in favor of extending the discussion of privacy away from the individual and toward the design of information systems people are embedded in and covered by.40 This perspective will come to matter again, in the discussion of platform business models and companies and artificial intelligence. The scientific study and design of data collection and use practices and protocols clearly also has to consider privacy concerns on both the individual and the systems level.
3.8 Further reading
For an instructive popular account of using data to learn about the world see Hand (2007).
For a helpful overview over different sociological approaches to quantification and its consequences see Mennicken & Espeland (2019).
For a sociological account of how the state and other central authorities make the world legible and increase their level of control over it see Scott (1998).
For a broad account of the role of models in the social sciences see Page (2018).
For more on the statistics models are based on see McElreath (2020).
For more on privacy see Solove (2008).
3.9 Review questions
- Provide definitions of the terms:
- Data
- Quantification
- Big Data
- Digital trace data
- Bias in datat sets
- Measurement
- Representational measurement
- Pragmatic measurement
- Public opinion
- Metrics
- Model
- Privacy
Please discuss how big data can contribute to the quantification of social and political life. Discuss the potential and limitation of at least two sources of big data specifically.
Please select a target concept of interest asking for pragmatic measurement. Then choose a source of digital trace data and discuss how the available signals allow – or do not allow – for the measurement of the target concept.
Discuss how models allow organizations and people to shape the future.
Using the framework provided by Solove (2008) discuss the different ways digital technology constitute threats to people’s privacy.
This definition is based on Hand (2007), p. 7 and Hand et al. (2001), p. 25.↩︎
For a popular account of how quantification allows to learn about the world see Hand (2007). For a popular account of quantification and related issues from a sociological perspective, see Mau (2017/2019). For an academic review of associated sociological research see Mennicken & Espeland (2019).↩︎
For the classic development of this argument see Foucault (1966/1994). For an account of the history of quantification see Porter (1995/2020). For a sociology of classification and its consequences see Bowker & Star (1999).↩︎
For a measured discussion of big data see Holmes (2017). For the uses of big data in the social sciences see Schroeder (2014).↩︎
For a foundational discussions of how to interpret digital data traces see Howison et al. (2011). For a discussion of the mediation process translating phenomena, behavior, and attitudes into digital data traces see Jungherr et al. (2016).↩︎
For discussions of research with big data see Salganik (2018). For a discussion of how to account for the limits of digital trace data in the social sciences see Jungherr (2019).↩︎
For critical discussions of data sets in artificial intelligence and algorithmic decision making see Barocas & Selbst (2016), Bolukbasi et al. (2016), Buolamwini & Gebru (2018), Mayson (2019).↩︎
For comprehensive discussion of measurement in various scientific fields see Hand (2004).↩︎
For two informative accounts about the historical origins of measurement see Crosby (1996), Deringer (2018).↩︎
For a history of ideas of libertarianism see Zwolinski & Tomasi (2023).↩︎
For a proposition of how to measure individuals’ alignment with libertarianism through survey responses see Iyer et al. (2012).↩︎
For more on public opinion, its history, and uses see Herbst (1993); S. Igo (2007).↩︎
For a discussion of measurement validity see Hand (2004), p. 131–134.↩︎
For more on the limits of self-reports in media research see Prior (2009).↩︎
For a discussion of measurement reliability see Hand (2004), p. 134–145.↩︎
For more on the practice of digitally assisted door to door campaigning see Nielsen (2012). For more on the development and maintenance of digital campaign technology see Kreiss (2012) and Kreiss (2016).↩︎
For Taylorist “scientific management” see Taylor (1911). For metrics in the Vietnam War see Halberstam (1972). For metrics based management see Wooldridge (2011). For “new public management” see Pollitt & Bouckaert (2017). For “objectives and key results (OKRs)” see Doerr (2018).↩︎
For a spirited critique of metrics based management see Muller (2018). For reactivity to metrics by those measured see Espeland & Sauder (2007)↩︎
For the use of metrics by news media see Christin (2020). For the use of metrics in policing see Ferguson (2017). For the use of metrics in political activism see Karpf (2016). These studies are not only helpful in learning about these specific cases. They are also helpful by providing templates of how to approach the use of metrics in organizations and associated effects scientifically.↩︎
On the workings and uses of recommender systems see Narayanan (2023).↩︎
See Sifry (2023) for a critical reflection on the long-term effects of quantified campaign tactics by the Democratic Party.↩︎
For more on data driven predictions in the Obama campaigns see Hersh (2015) and Nickerson & Rogers (2014). For a discussion of data-driven campaigning in Germany see Jungherr (2016).↩︎
For introductions to the use of simulation within the social sciences see Miller & Page (2007) and Epstein (2006).↩︎
On the larger history of privacy and increased in the US see S. E. Igo (2018). For a discussion of how politicians actively expand the information collected in official voter files and the access of campaign organizations’ to them see Hersh (2015). For a critical discussion of how privacy rights have over time be framed as limits to innovation and state capacity see J. E. Cohen (2013).↩︎
For more on the Cambridge Analytica scandal see Kroll (2018).↩︎
For more on the limited effects of psychometric targeting advocated by Cambridge Analytica scandal see Hersh (2018).↩︎
For more the digital ad business see Auletta (2018); Crain (2021). For a critical take on the efficiency of these models see Hwang (2020).↩︎
For more on the GDPR and its international influence see Bradford (2020). For more on the mutual influences and dependencies between US and EU in data regulation see Farrell & Newman (2019).↩︎
For more on the subsequent uses of Obama’s data see Meckler (2012) and Timberg & Gardner (2012).↩︎
For a general discussion of privacy and medical data see Price & Cohen (2019). For privacy in Covid-tracking approaches specifically see I. G. Cohen et al. (2020).↩︎
For more on China’s Social Credit System see Brussee (2023); Creemers (2018); Knight & Creemers (2021).↩︎
For a background on classic and current forms of foreign influence see Rid (2020). For more on tensions regarding reliance on Huawei see Segal (2021).↩︎
See J. E. Cohen (2012); J. E. Cohen (2019); Nissenbaum (2009); Simitis (1987); Simitis (1995).↩︎