Einstein, as everyone knows, famously defined insanity as doing the same thing repeatedly and expecting different results. Science is the mirror image of insanity (which is not to say there are no mad scientists). It expects — indeed, requires — the same results when scientists do the same experiments or calculations over and over. Thus, according to an important and widely noticed study just published in Science, “Estimating The Reproducibility Of Psychological Science,” there is a real question whether much of the allegedly scientific research published in learned journals of psychology actually qualifies as science.
The Reproducibility Project, coordinated by University of Virginia psychology professor Brian Nosek, executive director of the Center for Open Science, involved a team of 270 psychologists from around the world who attempted to replicate the findings of 100 articles published in 2008 selected from three leading psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition.
A substantial majority of the studies studied, it turned out, were not reproducible, leading to “a clear conclusion” (as stated in the Science report): “A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.”
“Weaker evidence for the original findings” is a polite, statistically precise but obfuscatory way of saying that the conclusions of those studies could not be confirmed. Reviewing these results, the New York Times declared in an article with an almost humorously understated title that “Many Psychology Findings Not As Strong As Claimed, Study Says.” The actual Times article was considerably more dramatic than its title suggests, noting for example that “Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall ‘effect size,’ a measure of the strength of a finding, dropped by about half across all of the studies.”
The fact that the Reproducibility Project found that the findings of nearly two-thirds of the studies its researchers examined could not be reproduced is proving to be a substantial embarrassment in the field of psychology, and those associated with the review project are making a great effort to soften the impact of their striking results. “The eye-opening results don’t necessarily mean that those original findings were incorrect or that the scientific process is flawed,” the Smithsonian Magazine insisted.
When one study finds an effect that a second study can’t replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A’s result may be false, or Study B’s results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.
“This project is not evidence that anything is broken. Rather, it’s an example of science doing what science does,” says Christopherson. “It’s impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”
Well, sure, but how reassuring is it to be told that about two-thirds of the presumably peer-reviewed psychological research published in leading journals is wrong … but only “temporarily”?
The original studies were virtually (probably literally) all based on experiments that the reproducers tried to reproduce, and thus a substantial amount of both the originals and the reproducers’ studies was devoted to statistical analysis of significance, reliability, etc. That is no doubt as it should be, and to its credit the Reproducibility Project and the Center for Open Science have made all of their own research available online. Perhaps in the future another reproducibility with even more resources and researchers will check the work of this one.
Odd Research Design
If there is such an effort in the future, I think it would be in order to consider a dimension that so far as I can tell was not attempted here — moving beyond an analysis of the statistical fit between research methodology and conclusions to a more qualitative consideration of the research design, significance, and even good sense. Several of the studies I looked at would have fallen short on those grounds even if their conclusions had been found to be statistically valid.
Consider, for example, K.R. Morrison and D.T. Miller, “Distinguishing between silent and vocal minorities,” Journal of Personality and Social Psychology 94 (2008): 871-882, whose results were confirmed, here, for the Reproducibility Project by Prof. Matt Motyl of the University of Illinois at Chicago. Morrison and Miller set out to test the entirely reasonable hypothesis that people will be more willing to express their opinions to an audience they think supportive than one they think would be critical. To test this hypothesis they
compared the proportions of bumper stickers [counted in the parking lots of 3 Target department stores] expressing liberal or conservative opinions in a county that voted for a more liberal candidate or a more conservative candidate in the 2004 US Presidential Election. Specifically, they hypothesized that liberals in the liberal county would be more likely to express their opinions than conservatives in the liberal county, and conservatives in the conservative county would be more
likely to express their opinions than liberals in the conservative county.
Surprise! There were more Democratic bumper stickers in the Democratic county and more Republican bumper stickers in the Republican county. But do these findings really confirm the hypothesis? Can’t they be as readily explained by the fact that Democratic counties have more Democrats and Republican counties more Republicans? Or perhaps the political parties were more organized and had more to spend on bumper stickers in counties where they were strong. And do we know the demographic/political breakdown of Target shoppers? Thus the fact that these findings were replicated by this method hardly makes them more significant.
I also looked at two studies by Stanford’s Claude Steele and co-authors purporting to test his ubiquitous “stereotype threat” theory. In “The Space Between Us: Stereotype Threat and Distance in Interracial Contexts,” Journal of Personality and Social Psychology 94 (2008): 91-107, the authors “use stereotype threat theory as a model” to test a prediction that whites would physically distance themselves from blacks in a conversation where the whites feared being stereotyped as racists. In a sense the theory, assumed to have been established by Steele’s earlier work, was used to test itself. Elaborate scenarios were established, and the authors found to their relief and satisfaction that the target white males sat closer to the black confederates when the conversation was about “love and relationships” than when the subject was “racial profiling,” unless the latter were described as a “learning experience.”
The attempt to replicate this study “was unable to attain statistical significance.” It did confirm that when the subject was racial profiling whites sat farther from blacks but was unable to attribute that to any perceived “stereotype threat” fear of being regarded as racist because the distance was largely unaffected by the “learning experience” variable. “Perhaps the prominence of racial profiling in the media, such as Ferguson, Missouri, and New York, has made people, regardless of ethnicity, more apprehensive to discuss the topic and subsequently distance themselves more during conversation,” the replication author suggested. The replication, however, did not even attempt to evaluate the authors’ conclusion that the “social distance” they found confirmed their view that “one’s concern with appearing prejudiced might have the ironic and unintended consequence of causing racial harms,” that “there may be ‘racism without racists.’” Thus there is reason to doubt whether those conclusions would be warranted even if the replication had been able “to attain statistical significance.”
In another study, “Social Identity Contingencies: How Diversity Cues Signal Threat or Safety for African Americans in Mainstream Institutions,” Journal of Personality and Social Psychology 94 (2008): 615-630, Steele et al. claim to have demonstrated that “people at risk of devaluation based on group membership are attuned to cues that signal social identity contingencies — judgments, stereotypes, opportunities, restrictions, and treatments that are tied to one’s social identity.”
In English: blacks are attuned to cues that they might be devalued because they are black. One of the most prominent threatening cues identified by Steele and his co-authors was “colorblindness,” which can be seen as “a means to ignore or invalidate the challenges that come with stigmatized group identities. Interpreted in this way, a colorblind diversity philosophy is diagnostic of marginalization, and we expect this cue to activate threatening social identity contingencies.”
The analysis of this study “did not replicate the original finding that fairness cues create more trust for Black but not White participants in an environment with low-minority representation.” It did not, however, attempt to evaluate the accuracy of reasonableness of the “cue” that a company’s colorblind policy can be seen as a threat to marginalize its black employees. But even if that and the study’s other findings were confirmed, however, the original study would probably provide more convincing evidence of the pervasive political correctness in the Bay Area, where the participants were selected, than the persuasiveness of Steele’s “stereotype threat” theory.
Methodological replication, in short, is important … but it is not all-important. Studies like these three, for example, would be unconvincing even if their findings were confirmed.