3.1. Data: Measurement, quantification, and control

3.1.1. Data and measurement

Data is a broad term. It refers to collections of true statements and observations about the world. These can be lists of items and their availability at a given point in time. Such as lists of the amount of grain stored in the capital of ancient Sumer. It can also be systematic observations about the world. For example scientific journals or field notes. Data can also be lists of instances of specific phenomena next to observations of co-occurring events, locations, or phenomena. For example John Snow‘s famous data on the 1854 London cholera epidemic.

While the term data clearly can refer to qualitative observations, from the enlightenment onward it is used with a strong bend toward the numerical. Translation of observations into numbers allowed scientists to deploy mathematics in order to identify systematic structures and patterns within observations about the world that otherwise would have remained hidden.

Again, the example of John Snow is instructive. By collecting observations about the location where victims of the 1854 London cholera epidemic lived, he was able to statistically identify the water supply though local wells as desease transmitter. Before his systematic collection of information about the victims and their characteristics - such as location of their living quarters and subsequently shared supplies of water - and making them available for statistical analysis, these connections remained invisible.

By translating observations into numbers, making them quantifiable, data provide the opportunity to new and important insights about the world and causes for events and phenomena. But with any translation, making phenomena countable means also losing some of their features. Quantification therefore makes some things visible, while hiding others. This makes measurement, the translation of phenomena, events, behavior, or concepts into numbers a crucial step in quantification and science more broadly. While on the face of it measurement seems straight forward, it is in fact a difficult practical and conceptual task, especially in the social sciences.

David J. Hand identifies two important challenges in measurement:

"Extreme representational measurement involved establishing a mapping from objects and their relationships to numbers and their relationships. Pragmatic measurement involved devising a measurement procedure which captured the essence of the characteristic of interest, so that pragmatic measurement simultaneously defined and measured the characteristic. We could almost say that representational measurement is based on modelling observed empirical relationships, while pragmatic measurement is based on constructing attributes of interest. Most measurement procedures have both representational and pragmatic aspects."

[Hand, 2016], p. 17.

What Hand calls representational measurement focuses on representing objects and their relationship in numbers. This is the easy task in measurement. An example for representational measurement in politics is the counting of votes for parties in specific voting districts. The object measured in this example are votes. The measured relationships could be votes by party or votes by party per district. This is relatively straightforward.

Things become more difficult in the measurement of latent concepts, such as attitudes or personality traits. This would require what Hand calls pragmatic measurement. In cases like this, scientists have to first decide on how the concept of interest manifests in measurable signals. In political psychology, this usually leads to the construction of question sets indirectly measuring the concept of interest. This makes measurement indirect and introduces potential errors when the signals used do not reliably or validly connect to the concept of interest. As we have already seen in the chapter on computational social science, the challenges of pragmatic measurement are especially grave once we try to infer concepts of interests based on data we just happen to collect through digital media.

3.1.2. Big data

In the past, collecting data was difficult. People had to consciously go out, count, and record their objects of interest. In social science, they had to run expensive surveys in which people visited or called respondents and had them answer questionnaires. This made data expensive to collect. Digital technology changed this. As we have already discussed in the chapter on computational social science, one important promise of digital technology for social science has been the massive increase of available data sources, be it through digitization of existing data sets, the passive collection of data through a multitude of digital sensors, or the automated logging of user behavior on digital services. Digital technology brought a new data abundance to society, business, and academia. Data suddenly became cheap. This created a lot of enthusiasm among business consultants, journalists, and some scientists about the supposed potentials of unknown data riches. The term big data became the focus point of these enthusiasms.

Originally, the term big data refers to large data sets that technically could not be held or processed in one data base or on one machine. But soon it gained popularity as a term covering the new data riches provided through digital technology and the associated economic benefits. One of the earlier characterization of big data was proposed by the business consultancy Meta Group and - after its subsequent acquisition by the consultancy Gartner - has become known as the Gartner definition:

""Big data" is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making."

[Laney, 2001].

Since the original formulation of this definition in 2001, many additional Vs have been suggested in covering new or unobserved characteristics of big data. But the original three Vs - volume, velocity, and variety - should suffice for our purposes. They illustrate the suspected promises of big data very clearly, while also pointing to one of the crucial shortcomings of subsequent efforts and debates. The definition points to its origin, being interested in the technical issues arising from handling data sets made available through digital technology. Data comes in great volume, surpassing the capabilities of standard computational set-ups and statistical methods. Data comes in great velocity, on the one hand allowing for real-time analysis of unfolding phenomena, but at the same time also providing challenges by shifting features within the data and data generating processes. Finally, data comes in great variety, such as text, image, video, audio, or meta-data. As a consequence, data need to be structured in order to allow for subsequent analysis.

The definition puts its focus clearly on the technical features of big data, and not their characteristics as measurements of objects in the world. This can be forgiven, given the definition's origin with a technical consultancy trying to prepare its clients for future opportunities or challenges. Unfortunately, the same focus on technical features and disregard for questions related to measurement, especially what Hand calls pragmatic measurement has dominated the subsequent debate and use of the term. This is much less forgivable.

Nearly all discussions of big data treat associated data as true representations of what ever is of interest to the speaker. This could be buying intentions, psychological traits, or political affiliations. Whatever happens to be in the sights of researchers. If we would follow the big data boosters, we could expect to find everything and all in big data, no matter what we are looking for. This is of course not the case. What goes for other type of data also holds for big data. In translating objects into measurements, there is a construction step involved, creatively translating objects of interest into signals found in data. This important interpretative step is just as important in the age of big data as before. Arguably it is even more important, since people now have access to data not primarily collected with their research goal in mind. To mobilize the potential of these found data for research, demands for very active and creative steps in pragmatic measurement.

It is somewhat ironic then, to find the people with the greatest access to data in history predominantly to be disinterested in the necessary steps of measurement pragmatics, in other words, reflecting on the translation necessary to map these data on phenomena of interest and therefor making them fit for analysis. This being said, this begins to change. With the growing awareness of biases in data sets at the heart use of artificial intelligence and algorithmic decision making, the data generating process, compositions, and inherent inequalities of underlying training data sets have come in the focus of researchers. Accordingly, questions of who gets counted by big data and what gets counted are increasingly getting more attention. Still, there remains much to do to overcome the current naive positivism in the work with big data.

3.1.3. Data and control

Data reduce the complexity of society and the world by structuring information in standardized categories of objects. Going further, by quantification numerical data allow those collecting them to identify underlying but hidden patterns and connections between objects. Through this, data help societal institutions - such as the state, political parties, firms, or other organizations - to make sense of the world, and increase their level of control.

The sociologist James R. Beniger has characterized control in societies through central institutions as:

"Here the word control represents its most general definition, purposive influence toward a predetermined goal. (...) influence of one agent over another, meaning that the former causes changes in the behavior of the latter; and purpose, in the sense that influence is directed toward some prior goal of the controlling agent. (...) control encompasses the entire range from absolute control to the weakest and most probabilistic form, that is, any purposive influence on behavior, however slight."

[Beniger, 1989], p. 7f.

In order to exercise control, institutions in society need to be able to read the world. They need to understand developments, outcomes of interest, contributing factors, and have an understanding of how they are able to shape outcomes in their interest. For this, institutions need data. By reducing the complexity of society and the world, data make the world legible to institutions and thereby provide them with the opportunity for intervention.

The sociologist James C. Scott, identifies "legibility as a central problem in statecraft" [Scott, 1998], p. 2:

"(...) much of early modern European statecraft seemed similarly devoted to rationalizing and standardizing what was a social hieroglyph into a legible and administratively more convenient format. The social simplifications thus introduced not only permitted a more finely tuned system of taxation and conscription but also greatly enhanced state capacity. They made possible quite discriminating interventions of every kind, such as public-health measures, political surveillance, and relief for the poor."

[Scott, 1998], p. 3.

Data provide standardized representations of reality, allowing institutions to make sense of reality, assess problems, intervene as they see fit, and evaluate their success. But this process of making the world legible means also engaging in what Hand called pragmatic measurement. Institutions have to actively and creatively translate the aspects of reality that appear important to them, or are signals of underlying phenomena they want to track or influence into standardized and countable objects. This is a process of translation and abstraction in which meaning and details are lost. In this translation and the associated loss of details lie risks of losing important insights. Still, some sort of reduction is necessary for institutions to read and interact with society and the world. Scott illustrates this in the discussion of the work of state officials:

"Officials of the modern state (...) assess the life of their society by a series of typifications that are always some distance from the full reality these abstractions are meant to capture. (...) The functionary of any large organization "sees" the human activity that is of interest to him largely through the simplified approximations of documents and statistics (...). These typifications are indispensable to statecraft. State simplifications such as maps, censuses, cadastral lists, and standard units of measurement represent techniques for grasping a large and complex reality; in order for officials to be able to comprehend aspects of the ensemble, that complex reality must be reduced to schematic categories. The only way to accomplish this is to reduce an infinite array of detail to a set of categories that will facilitate summary descriptions, comparisons, and aggregation."

[Scott, 1998], p. 76-77.

But for this to work, the translation of society and the world into categories and numbers - quantification - must cover the actual objects of interest and contributing elements. This makes this not an effort in simple representational measurement but instead pragmatic measurement, foregrounding the challenges associated with the later. This goes double for the questions of whether targeted internventions are succesful or not. This depends strongly on the appropriateness of the pragmatic measurement process underlying the quantification of the world in data. If the map of the territory is wrong or inappropriate, navigating by it will not lead you to your destination.

The expectation of increasing the opportunities for control by institutions explains the widespread excitement about big data and its supposed powers. An increase in data means an increase in quantification which in turn supposedly means an increase in control for central institutions, in other words an increase in their influence over other agents or processes in order to shape them according to their purpose. We find these expectations of control most pronounced in the role of data in business and management and their supposed role in allowing the manipulation of people through information, especially ads.

3.1.4. Metrics

Business and management try to increase efficiency and productivity through metrics. They break down their production process in small identifiable steps and track the achievement of these steps together with inputs and outputs. If done well, this allows for the detailed monitoring of the production process, identification of inefficiencies, the design of interventions to improve efficiency or productivity, and central control over the production process. Metrics, the result of pragmatic measurement, are a crucial element in this process.

The idea of metric-based management goes at least back to the Taylorist approach to "scientific management". In the early nineteen hundreds, the engineer Frederick Winslow Taylor, advocated the calculation of standard levels of outputs for each job contributing to factory outputs. Workers who hit these metrics or outperformed them were financially rewarded, while those who fell behind were payed less. These ideas became very popular. They were ported by US defense secretary Robert McNamara to quantify and track progress during the Vietnam War, they were the inspiration to metrics based management, moved into the public service sector under the term "new public management", and recently experienced a revival in Silicon Valley tech companies under the term "objectives and key results (OKRs)".

The idea behind these schemes is usually the same: define a set of key steps important in the pursuit of an organizations' goals and track their achievement over time. By tracking inputs - such as raw material or working hours - and outputs - such as units produced - managers can supervise the process, incentivize the behavior of workers, or look for hidden inefficiencies. At least that's the promise.

In practice, management by metrics depends on the suitability of metrics covering the relevant inputs and outputs. This sounds trivial, but often what is easy to measure is not necessarily relevant, while what is relevant is not neccessarily easy to measure. By failing in the pragmatic measurement process, by misconstructing measured signals, metrics-based management can lead on organization to focus on achieving metrics instead of pursuing what is necessary to produce their outputs. The fundamental problem in the measurement or quantification of society and the world therefore also hold in these cases.

Through the greater availability and pervasiveness of data, enabled by digital technology, the reach of metrics has extended. This is true for established business that now start to look to digital metrics in order to assess their success of failure, such as news media assessing the success of articles or journalists by the number of views or interactions they generate in digital communication environment. Or this can be actors who up until then had not access to data documenting their success or failures in real-time, such as politicians or micro-celebrities. These actors also now find many digital metrics available to them, supposedly tracking their fates while providing them opportunities for interventions targeted at improving it.

By a multitude of publicly visible or private metrics, digital technology provides many actors new opportunities for making legible their environment in which they pursue their goals. Unfortunately, through their simplicity and instant availability, these metrics are often not critically interrogated as to whether they actually speak to the goals of actors using them.

For example, while a politician can easily track likes on Facebook, it is much harder to determine whether these likes actually correspond with preferences of her constituents. So, in the worst case, she might optimize her positions and rhetoric to the audience of her Facebook posts while losing from sight, the preferences of the people voting for her.

If not used consciously and critically, metrics can easily create a quantified cage for actors relying on them. By making it easy to optimize toward them, metrics can provide the illusion of control to central authorities using them to control an organization. But while an organization might do well according to the abstract quantification of the world - metrics - its actual fate might be much less benign. As Scott [1998] points out, while an abstract map is necessary for the centralized control of complex systems, the same map can mislead controlling units and navigating through it can lead to failures, small and large. This is just as true for metrics in the age of big data, as for those coming before.

But sometimes, it is not just about people adjusting their behavior to hit metrics. Sometimes, the system automatically adjusts for them. This is where algorithms come into play.