Text analysis: Typical approaches in computational social science 1
2.3. Text analysis: Typical approaches in computational social science 1¶
2.3.1. Text analysis in political science¶
For general accounts of computationally-assisted text analysis in the political sciences see Benoit , Grimmer, Roberts, and Stewart . For the collection and analysis of parliamentary debates see Schwalbach and Rauh . For text analysis and event detection see Beieler, Brandt, Halterman, Schrodt, and Simpson . For text analysis of news coverage Barberá, Boydstun, Linn, McMahon, and Nagler . For text analysis in literature or philosophy corpora see Piper , Underwood .
One approach typically associated with computational social science that is very prominently used in political science is text analysis. One reason for this is the long tradition of using text corpora in various subfields of political science going back thirty years or more. Examples include the measurement of positions and preferences of parties based on manifestos, the analysis of positions and tactics of parties and politicians based on parliamentary debates, the automated identification of events and actors in text corpora for the development of event- and actor-databases in international relations and conflict studies, or the analysis of news coverage in agenda research and discourse studies. These subfields identified early on the potential of computational approaches in the pursuit of their research goals using large text corpora. Accordingly, we find rich traditions and practices in these subfields going back thirty years or more in working with text. This includes standardized approaches in the collection, preparation, analysis, reporting and the provision of large text corpora, starting with human centered approaches and today reflecting deeply on the uses of and standards for the work with computational methods.
Text is a rich medium reflecting cultural and political concepts, ideas, agendas, tactics, events, and power structures of the time. Large text corpora open windows and allow comparisons across countries, cultures, and time. Different types of text contain representations of different slices of culture, politics, and social life. They therefor are of interest to various subfields in the social sciences and humanities.
For example, political texts (such as parliamentary speeches or party manifestos) express the political ideas of the time and chosen tactics of politicians or parties. At the same time, they are also expressions of public performances of these ideas and tactics. They are therefor product of choices by political actors how they want to be seen by journalists, their constituents, party members, or the public at large. This makes them not direct expressions of their true nature or positions but instead expressions mediated through conventions associated with the type of text under examination. These and other potentially relevant mediating factors need to be accounted for in subsequent analyses and the interpretation of any identified patterns. While remaining aware of these mediating factors the analyses of corpora of political texts between countries and over time allows for the identification of various interesting political phenomena, this includes shifts in the positions of parties, the introduction of new ideas into political discourse, agenda shifts, or tactical choices in language or rhetoric. This and their relatively codified and structured format makes political texts popular in political science.
Since news content is protected under copyright law, establishing big publicly available text corpora of news coverage is difficult. Luckily, various news organizations start providing standardized access to their archives, which enables the reliable collection and analysis of their respective coverage. See for example New York Times (https://developer.nytimes.com). Content of other news organizations can be accessed through third-party providers or specific licensing deals. For event data see https://www.gdeltproject.org. For an overview of agenda setting research see McCombs and Valenzuela . For examples of discourse analyses based on news coverage see Baumgartner, Boef, and Boydstun , Benson , Entman , Ferree, Gamson, Gerhards, and Rucht .
Another important text source in political science is news text. News contain a variety of signals that are of great interest to social scientists. This includes the reporting of key political events and figures in politics and society. News cover the what, when, where, who, and sometimes even why of important political or societal events. Extracting these features from journalistic accounts allows the establishment of standardized, large-scale databases of international events and actors. Approaches like these have been successfully used in conflict studies. News texts are also a prominent basis for the analysis of political agenda setting and agenda shifts. Identifying the frequency and time of the coverage of selected topics, researchers can identify the relative importance events have in press coverage and compare that with their importance in political speech, public opinion surveys, or digital communication environments. Finally, the analysis of news coverage also allows for the analysis of discourse dynamics over time. How are current important topics discussed in the media, what are the aspects different sides emphasize, what are the arguments, and who are prominent speakers given voice to in the media? These questions can be answered based on the analysis of news coverage and that provide important insights into the way societies negotiate contested topics, such as foreign policy, immigration, or reproductive rights.
By collecting and preparing for analysis large text corpora scholars can therefor access and make available vast troves of knowledge on various questions and in different subfields. The tremendous collective efforts in digitizing and making available text corpora are a massive accelerating factor in this effort.
For an overview of computer-assisted text analysis in political science see Grimmer, Roberts, and Stewart . For practical advice on how to do computational text analysis in R see Hvitfeldt , Silge and Robinson , for quanteda (https://quanteda.io) an R package of high popularity with political scientists see Benoit, Watanabe, Wang, Nulty, Obeng, Müller, and Matsuo , for Python see Bengfort, Bilbro, and Ojeda , Lane and Dyshel .
To get a better sense of computer-assisted text analysis in action, let's have a look at three recent studies, using different text types and different methods in answering their respective questions.
2.3.2. Making sense of party competition during the 2015 refugee crisis with a bag of words¶
See Gessler and Hunger .
In their 2021 article "How the refugee crisis and radical right parties shape party competition on immigration" Theresa Gessler and Sophia Hunger study a corpus of 120,000 press releases by parties from Austria, Germany, and Switzerland. The authors are interested in whether parties changed the emphasis of the topic immigration and their position on immigration in their press releases between 2013 and 2017. The authors ask whether the attention of parties in their press releases to the topic immigration followed a long-term trend in politicizing the topic driven by the emergence of far-right parties or instead whether attention shifts were driven by the heightened levels of public attention on immigration during the events of 2015. With their study, they contribute to the scientific debate about party competition and agenda setting and position their findings with regard to theories in both areas. At the same time, they provide an instructive example of how to anchor an empirical study within theory, creatively establish and justify new data sources, and use an intuitive and comparatively accessible computational method in the analysis of text.
In order to answer their questions, the authors introduce a new data source: monthly press releases by parties. In their research design section, they justify this choice. The predominant data source for work in comparable fields are party manifestos. Those have proved valuable in the study of political competition and agenda shifts but their characteristics limit their applicability for the authors' purposes. Due to their sparse publication rhythm, following the electoral calendar, they do lend themselves for the analysis of long-term trends but not for the identification of short-term shifts driven by current events and sudden shifts in public opinion. Press releases, due to their higher frequency and their connection to current events are more promising in this regard. This reasoning opens the intriguing possibility that theorizing about long-term trends within party positioning is less about the subject being necessarily primarily shaped by long-term trends but an artefact from the availability of data sources allowing only for the analysis of this type of question. Mobilizing new data sources thereby potentially opens up new aspects of the phenomenon that remained invisible before.
To get a better sense of how to evaluate the quality of competing text analysis methods see the online appendix of Gessler and Hunger .
The authors collected 120,000 press releases from major parties in Austria, Germany, and Switzerland published between 2013 and 2018. To identify press releases referring to immigration, the authors developed a dictionary containing words referring to immigration and integration. To evaluate the performance of their dictionary, the authors hand-coded 750 randomly-selected press releases and tested the quality of different dictionary approaches and a specifically trained support vector machine classifier (SVM). Their dictionary outperformed others in the identification of immigration-related press releases and performed in similar quality as the SVM. Accordingly, they chose their computationally less demanding and at the same time interpretatively more accessible dictionary to classify the remaining press releases over the SVM. The proportion of thus identified press releases referring to immigration of all press releases during a given month allows the authors to identify the relative salience of the topic and temporal shifts over time in comparatively high temporal resolution.
In order to identify the relative position on immigration of parties, the authors use Wordscores. First developed by Michael Laver, Kenneth Benoit, and John Garry, Wordscores try to identify relative topical positions of parties based on the similarities and distinctiveness of words they use in text. The more similar the words, the more similar the positions. The more distinctive, the further they are. Simplifying their approach somewhat, Wordscores allow Gessler and Hunger to identify whether parties meaningfully diverge from their original word use regarding immigration and whether over time they converge or diverge with words used by parties of the radical right in their sample. They take this as proxy for position shifts of parties with regard to immigration in either accommodating or confronting positions of the radical right.
Using these approaches, Gessler and Hunger find that controlling for other factors it does seem that mainstream parties during the refugee crisis reacted to the greater attention paid to immigration by radical right parties by increasing their own attention to the topic in their press releases. But after the crisis subsided, they decreased their attention to the topic back to their original levels. In contrast, regarding the positions nearly all parties did not converge toward positions taken by the far right.
With their study, Gessler and Hunger not only provide compelling evidence on political competition between European mainstream and radical right parties during the 2015 refugee crisis. They also show how mobilizing a new data source can provide new evidence, allowing new perspectives in the analysis of scientifically long established subfields. By capitalizing on the greater temporal resolution provided by press releases, the authors open a window into short term patterns in political competition, otherwise invisible to researchers depending on established data sources only available in much lower frequency. The study is also an interesting case in attempting to identify shifts in two latent concepts (i.e. relative topic salience and position of parties) based on the analysis of text.
2.3.3. Who lives in the past, the present, or the future? A supervised learning approach¶
See Müller .
In his 2022 article "The Temporal Focus of Campaign Communication" Stefan Müller analyses the degree to which parties refer to the past, the present, and the future in their party manifestos before upcoming elections. He anchors this question with voting behavior theory, which considers retrospective and prospective considerations of voters. Thus temporal considerations matter to voters in elections, but do they also matter for campaign communication by parties. To answer this question, Müller analyses 621 party manifestos published between 1949 and 2017 in nine countries. Other than for Gessler and Hunger , this time the data source is clearly up to the task. Party manifestos are directly connected with elections and should reflect tactical considerations by parties regarding their self-presentation toward partisans, constituents, journalists, and the public at large.
For the Manifesto Corpus see Merz, Regel, and Lewandowski . For detailed information of the validation and comparison between the different classification approaches see the online appendix to Müller .
To answer his question, Müller collected all machine-readable manifestos in English or German from the Manifesto Corpus, leaving with 621 manifestos from nine countries. He then had human coders label sentences as referring to the past, present, or future. Either by directly labelling or by using pre-labeled data sets Müller ended up with an annotated sample of 5,858 English and 12,084 German sentences. This allowed him to train and validate several different computational approaches for the classification of the remaining sentences in the data set. He trained and validated a Support Vector Machine (SVM), a Multilayer Perceptron Network, and a Naive Bayes Classifier. Since all classifiers performed all comparatively well, the author chose the SVM since it provided the best trade-off between performance and computational efficiency.
Through this approach, Müller finds that 54% of sentences refer to the future, 37% the present, and 9% the past. But there is some variation between countries. In general, though, it appears like incumbent parties focus somewhat more on the past than opposition parties. This makes sense considering the different roles of incumbent and opposition parties in political competition. Incumbents run at least in part on their supposedly positive record of the past and opposition parties naturally challenge said record.
For more on the Linguistic Inquiry and Word Count (LIWC) sentiment dictionary see Tausczik and Pennebaker .
To get a better sense of how parties refer to the past, present, or future, the author uses German and English versions of the Linguistic Inquiry and Word Count (LIWC) sentiment dictionary. The dictionary lists terms that for test-subjects carried positive or negative emotional associations. By calculating the emotional loading of words used in sentences referring to the past, present, or future, Müller infers whether parties spoke positively or negatively about different temporal targets.
Using this approach, Müller finds that opposition parties tend to speak more negatively about the past than incumbents. Again, this finding is in line with the different roles of incumbent and opposition parties in political competition.
By using a pre-existing data set, the Manifesto Corpus, and further annotating it, Müller can show that parties indeed use temporal references differently according to their structural roles in political competition. The paper offers an interesting example for the use and evaluation of various supervised computational classifiers in the analysis of large data sets, enabling classification efforts for data sets whose size would make manual classification infeasible.
2.3.4. Political innovation in the French Revolution¶
And now, let's have a little fun.
In their 2018 article "Individuals, institutions, and innovation in the debates of the French Revolution" Alexander T. J. Barron, Jenny Huang, Rebecca L. Spang, and Simon DeDeo present an analysis of speeches held during the first parliament of the French Revolution, the National Constituent Assembly (NCA), sitting from July 1789 to September 1791. They have access to a corpus provided by the French Revolution Digital Archive containing 40,000 speeches by roughly a thousand speakers. The speeches held during this time frame are of great interest not only to historians but also parliamentary and democracy scholars, since they open a window into the process of epistemic and political sense making and innovation processes within one of the first modern parliamentary bodies that provided the template for many subsequent parliamentary institutions and democratic discourse in general.
For latent latent Dirichlet allocation (LDA) see Blei, Ng, and Jordan .
Barron and colleagues approach the text corpus through the lense of information theory. They are interested in determinants for the emergence of new ideas in parliamentary discourse and their persistence. For this they identify distinct word combinations through latent Dirichlet allocation (LDA). LDA is a popular automated approach for reducing the dimensionality of text to a set of automatically identified topics that are characterized by the frequent clustered occurence of words. Barron and colleagues use LDA to identify clusters of co-occurring words, topics, and assign them two metrics they calculate based on ideas from information theory: novelty and transcience.
For details on the operationalization of novelty and transcience see the online appendix to Barron, Huang, Spang, and DeDeo . For details on the Kullback-Leibler Divergence (KLD) see Kullback and Leibler .
With novelty, the authors refer to the degree to which a distinct word pattern in a speech as identified through LDA differs from patterns in prior speeches. With transcience, the authors refer to how the same word pattern differs from those in future speeches. The higher the transcience, the higher said difference. To measure these metrics for each distinct word pattern, they calculate a measure called Kullback-Leibler Divergence (KLD). Doing so allows the authors to quantify two important but interpretatively demanding concepts: How frequently are new ideas introduced in parliamentary discourse and how long do they persist? Once you have quantified these features of distinct word packages, you can start looking for determinants of either outcome. Who is responsible for the introduction of new ideas? And what are the contextual conditions for these ideas to survive and thrive?
Barron and colleagues show that in general the National Constituent Assembly (NCA) was a parliamentary environment clearly open for the introduction of novel ideas, but many of these new ideas did not persist long. Still, many speeches were at once highly novel and only weakly transient, this condition the author call resonance. Looking closely, the authors show that individuals differ with regard to their tendency to introduce new and resonant ideas. In fact, among the top 40 orators high-novelty speakers are usually associated with the political left and the bourgeoisie. In contrast, low-novelty speakers are on the political right and belong to the nobility. Going into detail even further, the authors show that high-profile individuals can deviate from these patterns, such as the left-wing radicals Maximilien Robespierre and Jérôme Pétion de Villeneuve whose speeches showed exceptionally high values of novelty and resonance. They consistently introduced new ideas in their speeches that were picked up by others and persisted over time. On the other side of the political spectrum, speakers like Jean-Siffrein Maury and Jacques de Cazalès exhibited low novelty and high resonance. The authors take this as evidence for their role in keeping the conversation in parliament coherent (low novelty) while at the same time being able to influence its future course (high resonance).
The authors point out that their results correspond with previous findings by historians taking more traditional analytical routes. But by translating meaningful and interpretatively demanding concepts into a small set of elegant quantitative metrics, Barron and colleagues provide a systems-level view of innovation and persistence in parliamentary debate. This further allows them to quantitatively identify the impact of features on individual, structural, and institutional levels for the introduction and subsequent fate of new ideas.
The study by Barron and colleagues shows powerfully how the use of computational methods combined with innovative theoretical concepts can open up new insights not only in our present and digital life but instead provide new perspectives to the past. Studies like these are bound to grow both in frequency and resonance with the continued digitalization of ever more historical data sets and archives and offer promising perspectives for interdisciplinary research.
Why not start by checking which other historical parliamentary records are available to you?
As with any set of examples, I could have chosen different studies that would have been just as interesting or would have provided insights into different approaches to text analysis. But already these three brief examples illustrate the breadth in available data sets, methods, and questions computational text analysis can be employed to answer. It is no wonder then to find this approach to be a highly prominent pursuit in computational social science and beyond.
In reading studies like these, or in CSS in general, always make sure to check for online appendices. Often, there you find the actual information about the ins and outs of doing the analysis, sometime get instructions of replicating reported findings, and get a much better sense in general of how a specific method was implemented.