Audit experiments are used to measure discrimination in a large number of domains (Employment: Bertrand et al. (2004); Legislator responsiveness: Butler et al. (2011); Housing: Fang et al. (2018)). Audit studies all have in common that they estimate the average difference in response rates depending on randomly varied characteristics (such as the race or gender) of a requester. Scholars conducting audit experiments often seek to extend their analyses beyond the effect on response to the effects on the quality of the response. Response is a consequence of treatment; answering these important questions well is complicated by post-treatment bias (Montgomery et al., 2018). In this note, I consider a common form of post-treatment bias that occurs in audit experiments.
To what extent do survey experimental treatment effect estimates generalize to other populations and contexts? Survey experiments conducted on convenience samples have often been criticized on the grounds that subjects are sufficiently different from the public at large to render the results of such experiments uninformative more broadly. In the presence of moderate treatment effect heterogeneity, however, such concerns may be allayed. I provide evidence from a series of 15 replication experiments that results derived from convenience samples like Amazon's Mechanical Turk are similar to those obtained from national samples. Either the treatments deployed in these experiments cause similar responses for many subject types or convenience and national samples do not differ much with respect to treatment effect moderators. Using evidence of limited within-experiment heterogeneity, I show that the former is likely to be the case. Despite a wide diversity of background characteristics across samples, the effects uncovered in these experiments appear to be relatively homogeneous.
AbstractExplanations for the failure to predict Donald Trump's win in the 2016 Presidential election sometimes include the "Shy Trump Supporter" hypothesis, according to which some Trump supporters succumb to social desirability bias and hide their vote preference from pollsters. I evaluate this hypothesis by comparing direct question and list experimental estimates of Trump support in a nationally representative survey of 5290 American adults fielded from September 2 to September 13, 2016. Of these, 32.5% report supporting Trump's candidacy. A list experiment conducted on the same respondents yields an estimate 29.6%, suggesting that Trump's poll numbers were not artificially deflated by social desirability bias as the list experiment estimate is actually lower than direct question estimate. I further investigate differences across measurement modes for relevant demographic and political subgroups and find no evidence in support of the "Shy Trump Supporter" hypothesis.
In Coppock (2014), I presented a reanalysis of Butler and Nickerson (2011), a field experiment that tested the effects of providing state legislators district-level public opinion data on their roll call votes for a bill. The reanalysis employed a method introduced by Bowers et al. (2013) to conclude that the Butler and Nickerson estimate of the total effect of treatment was biased downward; when spillovers were accounted for, the total effect of treatment was estimated to be nearly twice as large.
AbstractA field experiment carried out by Butler and Nickerson (Butler, D. M., and Nickerson, D. W. (2011). Can learning constituency opinion affect how legislators vote? Results from a field experiment.Quarterly Journal of Political Science6, 55–83) shows that New Mexico legislators changed their voting decisions upon receiving reports of their constituents' preferences. The analysis of the experiment did not account for the possibility that legislators may share information, potentially resulting in spillover effects. Working within the analytic framework proposed by Bowers et al. (2013), I find evidence of spillovers, and present estimates of direct and indirect treatment effects. The total causal effect of the experimental intervention appears to be twice as large as reported originally.
AbstractWe propose a framework for meta‐analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best‐ and worst‐case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, eight of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
Researchers have increasingly turned to online convenience samples as sources of survey responses that are easy and inexpensive to collect. As reliance on these sources has grown, so too have concerns about the use of convenience samples in general and Amazon's Mechanical Turk in particular. We distinguish between "external validity" and theoretical relevance, with the latter being the more important justification for any data collection strategy. We explore an alternative source of online convenience samples, the Lucid Fulcrum Exchange, and assess its suitability for online survey experimental research. Our point of departure is the 2012 study by Berinsky, Huber, and Lenz that compares Amazon's Mechanical Turk to US national probability samples in terms of respondent characteristics and treatment effect estimates. We replicate these same analyses using a large sample of survey responses on the Lucid platform. Our results indicate that demographic and experimental findings on Lucid track well with US national benchmarks, with the exception of experimental treatments that aim to dispel the "death panel" rumor regarding the Affordable Care Act. We conclude that subjects recruited from the Lucid platform constitute a sample that is suitable for evaluating many social scientific theories, and can serve as a drop-in replacement for many scholars currently conducting research on Mechanical Turk or other similar platforms.
AbstractSeveral theoretical perspectives suggest that when individuals are exposed to counter-attitudinal evidence or arguments, their pre-existing opinions and beliefs are reinforced, resulting in a phenomenon sometimes known as 'backlash'. This article formalizes the concept of backlash and specifies how it can be measured. It then presents the results from three survey experiments – two on Mechanical Turk and one on a nationally representative sample – that find no evidence of backlash, even under theoretically favorable conditions. While a casual reading of the literature on information processing suggests that backlash is rampant, these results indicate that it is much rarer than commonly supposed.