Social scientists are now in an era of data abundance, and machine learning tools are increasingly used to extract meaning from data sets both massive and small. We explain how the inclusion of machine learning in the social sciences requires us to rethink not only applications of machine learning methods but also best practices in the social sciences. In contrast to the traditional tasks for machine learning in computer science and statistics, when machine learning is applied to social scientific data, it is used to discover new concepts, measure the prevalence of those concepts, assess causal effects, and make predictions. The abundance of data and resources facilitates the move away from a deductive social science to a more sequential, interactive, and ultimately inductive approach to inference. We explain how an agnostic approach to machine learning methods focused on the social science tasks facilitates progress across a wide range of questions.
AbstractWe identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well‐suited to this task, but existing matching methods are ill‐equipped to handle high‐dimensional text data. Our proposed solution is to estimate a low‐dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users.
As with any high fashion, the beauty and horror of 'big data' is in the eye of the beholder. The question that prompted the present symposium-'Are formal theory, causal inference, and big data contradictory trends in political science?'-is representative of the concerns that big data has raised in political science. Indeed, this is representative of discussions underway in every area of social science about how big data interacts with existing modes of inquiry as well as its potential benefits (Lazer et al. 2009; Varian 2014) and potential pitfalls (Boyd and Crawford 2012; Lazer et al. 2014). A review of these discussions does not yield any consensus on even what is meant by the term 'big data.' For us, the concept is broad and simultaneously captures several ideas. Taken together, these new types of and approaches to data are enabling new forms of data-intensive political science, some of which in isolation appear to challenge established models of inquiry in political science and science more generally. We argue, however, that none of this means that big data is fundamentally incompatible with formal theory, causal inference, or social science research methods in general. To the contrary, big data already is interacting with formal theoretic and causal inference approaches in ways that are not only consistent with these approaches but that also enhance them by enabling us to answer new questions. Perhaps more important, social science is beginning to shape the world of big data. Much of big data is social data-that is, data about the interactions of people: how they communicate, how they form relationships, how they come into conflict, and how they shape their future interactions through political and economic institutions. It is the responsibility of social scientists to assume their central place in the world of big data, to shape the questions we ask of big data, and to characterize what does and does not make for a convincing answer. In the discussion that follows, we describe examples in political science in which big data helps us to (1) design better experiments, (2) make better comparisons between more precise populations of interest, and (3) observe theoretically relevant social and political behavior that previously was difficult to detect. Adapted from the source document.
Crisis motivates people to track news closely, and this increased engagement can expose individuals to politically sensitive information unrelated to the initial crisis. We use the case of the COVID-19 outbreak in China to examine how crisis affects information seeking in countries that normally exert significant control over access to media. The crisis spurred censorship circumvention and access to international news and political content on websites blocked in China. Once individuals circumvented censorship, they not only received more information about the crisis itself but also accessed unrelated information that the regime has long censored. Using comparisons to democratic and other authoritarian countries also affected by early outbreaks, the findings suggest that people blocked from accessing information most of the time might disproportionately and collectively access that long-hidden information during a crisis. Evaluations resulting from this access, negative or positive for a government, might draw on both current events and censored history.
In: Political analysis: PA ; the official journal of the Society for Political Methodology and the Political Methodology Section of the American Political Science Association, Band 23, Heft 2, S. 254-277
Recent advances in research tools for the systematic analysis of textual data are enabling exciting new research throughout the social sciences. For comparative politics, scholars who are often interested in non-English and possibly multilingual textual datasets, these advances may be difficult to access. This article discusses practical issues that arise in the processing, management, translation, and analysis of textual data with a particular focus on how procedures differ across languages. These procedures are combined in two applied examples of automated text analysis using the recently introduced Structural Topic Model. We also show how the model can be used to analyze data that have been translated into a single language via machine translation tools. All the methods we describe here are implemented in open-source software packages available from the authors.
In this article, we study the political use of denial-of-service (DoS) attacks, a particular form of cyberattack that disables web services by flooding them with high levels of data traffic. We argue that websites in nondemocratic regimes should be especially prone to this type of attack, particularly around political focal points such as elections. This is due to two mechanisms: governments employ DoS attacks to censor regime-threatening information, while at the same time, activists use DoS attacks as a tool to publicly undermine the government's authority. We analyze these mechanisms by relying on measurements of DoS attacks based on large-scale Internet traffic data. Our results show that in authoritarian countries, elections indeed increase the number of DoS attacks. However, these attacks do not seem to be directed primarily against the country itself but rather against other states that serve as hosts for news websites from this country.
In this article, we study the political use of denial-of-service (DoS) attacks, a particular form of cyberattack that disables web services by flooding them with high levels of data traffic. We argue that websites in nondemocratic regimes should be especially prone to this type of attack, particularly around political focal points such as elections. This is due to two mechanisms: governments employ DoS attacks to censor regime-threatening information, while at the same time, activists use DoS attacks as a tool to publicly undermine the government's authority. We analyze these mechanisms by relying on measurements of DoS attacks based on large-scale Internet traffic data. Our results show that in authoritarian countries, elections indeed increase the number of DoS attacks. However, these attacks do not seem to be directed primarily against the country itself but rather against other states that serve as hosts for news websites from this country.