Almost Two-thirds of Psychological Studies Are Wrong

psychology and the brain

Einstein, as everyone knows, famously defined insanity as doing the same thing repeatedly and expecting different results. Science is the mirror image of insanity (which is not to say there are no mad scientists). It expects — indeed, requires — the same results when scientists do the same experiments or calculations over and over. Thus, according to an important and widely noticed study just published in Science, “Estimating The Reproducibility Of Psychological Science,” there is a real question whether much of the allegedly scientific research published in learned journals of psychology actually qualifies as science.

The Reproducibility Project, coordinated by University of Virginia psychology professor Brian Nosek, executive director of the Center for Open Science, involved a team of 270 psychologists from around the world who attempted to replicate the findings of 100 articles published in 2008 selected from three leading psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition.

A substantial majority of the studies studied, it turned out, were not reproducible, leading to “a clear conclusion” (as stated in the Science report): “A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.”

Humorously Understated

“Weaker evidence for the original findings” is a polite, statistically precise but obfuscatory way of saying that the conclusions of those studies could not be confirmed. Reviewing these results, the New York Times declared in an article with an almost humorously understated title that “Many Psychology Findings Not As Strong As Claimed, Study Says.” The actual Times article was considerably more dramatic than its title suggests, noting for example that “Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall ‘effect size,’ a measure of the strength of a finding, dropped by about half across all of the studies.”

The fact that the Reproducibility Project found that the findings of nearly two-thirds of the studies its researchers examined could not be reproduced is proving to be a substantial embarrassment in the field of psychology, and those associated with the review project are making a great effort to soften the impact of their striking results. “The eye-opening results don’t necessarily mean that those original findings were incorrect or that the scientific process is flawed,” the Smithsonian Magazine insisted.

When one study finds an effect that a second study can’t replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A’s result may be false, or Study B’s results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.

“This project is not evidence that anything is broken. Rather, it’s an example of science doing what science does,” says Christopherson. “It’s impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”

Well, sure, but how reassuring is it to be told that about two-thirds of the presumably peer-reviewed psychological research published in leading journals is wrong … but only “temporarily”?

The original studies were virtually (probably literally) all based on experiments that the reproducers tried to reproduce, and thus a substantial amount of both the originals and the reproducers’ studies was devoted to statistical analysis of significance, reliability, etc. That is no doubt as it should be, and to its credit the Reproducibility Project and the Center for Open Science have made all of their own research available online. Perhaps in the future another reproducibility with even more resources and researchers will check the work of this one.

Odd Research Design

If there is such an effort in the future, I think it would be in order to consider a dimension that so far as I can tell was not attempted here — moving beyond an analysis of the statistical fit between research methodology and conclusions to a more qualitative consideration of the research design, significance, and even good sense. Several of the studies I looked at would have fallen short on those grounds even if their conclusions had been found to be statistically valid.

Consider, for example, K.R. Morrison and D.T. Miller, “Distinguishing between silent and vocal minorities,” Journal of Personality and Social Psychology 94 (2008): 871-882, whose results were confirmed, here, for the Reproducibility Project by Prof. Matt Motyl of the University of Illinois at Chicago. Morrison and Miller set out to test the entirely reasonable hypothesis that people will be more willing to express their opinions to an audience they think supportive than one they think would be critical. To test this hypothesis they

compared the proportions of bumper stickers [counted in the parking lots of 3 Target department stores] expressing liberal or conservative opinions in a county that voted for a more liberal candidate or a more conservative candidate in the 2004 US Presidential Election. Specifically, they hypothesized that liberals in the liberal county would be more likely to express their opinions than conservatives in the liberal county, and conservatives in the conservative county would be more

likely to express their opinions than liberals in the conservative county.

Surprise! There were more Democratic bumper stickers in the Democratic county and more Republican bumper stickers in the Republican county. But do these findings really confirm the hypothesis? Can’t they be as readily explained by the fact that Democratic counties have more Democrats and Republican counties more Republicans? Or perhaps the political parties were more organized and had more to spend on bumper stickers in counties where they were strong. And do we know the demographic/political breakdown of Target shoppers? Thus the fact that these findings were replicated by this method hardly makes them more significant.

I also looked at two studies by Stanford’s Claude Steele and co-authors purporting to test his ubiquitous “stereotype threat” theory. In “The Space Between Us: Stereotype Threat and Distance in Interracial Contexts,” Journal of Personality and Social Psychology 94 (2008): 91-107, the authors “use stereotype threat theory as a model” to test a prediction that whites would physically distance themselves from blacks in a conversation where the whites feared being stereotyped as racists. In a sense the theory, assumed to have been established by Steele’s earlier work, was used to test itself. Elaborate scenarios were established, and the authors found to their relief and satisfaction that the target white males sat closer to the black confederates when the conversation was about “love and relationships” than when the subject was “racial profiling,” unless the latter were described as a “learning experience.”

The attempt to replicate this study “was unable to attain statistical significance.” It did confirm that when the subject was racial profiling whites sat farther from blacks but was unable to attribute that to any perceived “stereotype threat” fear of being regarded as racist because the distance was largely unaffected by the “learning experience” variable. “Perhaps the prominence of racial profiling in the media, such as Ferguson, Missouri, and New York, has made people, regardless of ethnicity, more apprehensive to discuss the topic and subsequently distance themselves more during conversation,” the replication author suggested. The replication, however, did not even attempt to evaluate the authors’ conclusion that the “social distance” they found confirmed their view that “one’s concern with appearing prejudiced might have the ironic and unintended consequence of causing racial harms,” that “there may be ‘racism without racists.’” Thus there is reason to doubt whether those conclusions would be warranted even if the replication had been able “to attain statistical significance.”

In another study, “Social Identity Contingencies: How Diversity Cues Signal Threat or Safety for African Americans in Mainstream Institutions,” Journal of Personality and Social Psychology 94 (2008): 615-630, Steele et al. claim to have demonstrated that “people at risk of devaluation based on group membership are attuned to cues that signal social identity contingencies — judgments, stereotypes, opportunities, restrictions, and treatments that are tied to one’s social identity.”

In English: blacks are attuned to cues that they might be devalued because they are black. One of the most prominent threatening cues identified by Steele and his co-authors was “colorblindness,” which can be seen as “a means to ignore or invalidate the challenges that come with stigmatized group identities. Interpreted in this way, a colorblind diversity philosophy is diagnostic of marginalization, and we expect this cue to activate threatening social identity contingencies.”

The analysis of this study “did not replicate the original finding that fairness cues create more trust for Black but not White participants in an environment with low-minority representation.” It did not, however, attempt to evaluate the accuracy of reasonableness of the “cue” that a company’s colorblind policy can be seen as a threat to marginalize its black employees. But even if that and the study’s other findings were confirmed, however, the original study  would probably provide more convincing evidence of the pervasive political correctness in the Bay Area, where the participants were selected, than the persuasiveness of Steele’s “stereotype threat” theory.

Methodological replication, in short, is important … but it is not all-important. Studies like these three, for example, would be unconvincing even if their findings were confirmed.


5 thoughts on “Almost Two-thirds of Psychological Studies Are Wrong

  1. As an undergraduate I wrote a paper for an organizational behavior class about a theory of leadership. I read every book and published paper on the topic. As a graduate student several years later, I updated the paper after reading everything published after my first paper. For that leadership theory, the authors of studies were claiming correlations ranging from about .05 to .15. With such weak correlations it is no wonder that replicating the studies would yield indeterminate results, particularly when the numbers of subjects in the studies were perhaps 300 or far fewer.

  2. Per comments in the Alexander piece.
    Can “priming” be translated as the same mechanicals in “grooming” in other psychological profiling studies?

  3. Thank you for pointing out that so many of these psychology articles are ridiculous, even if “correct”. This is not said often enough. For example, the main problem with that “litter causes racism” article was not that it was fake (although it was); the problem was that the “experiment” was idiotic in the first place.

    Another point: It was considered important that the reproducers were “using materials provided by the original authors, reviewed in advance for methodological fidelity”. Yes, this is important if the only goal is to see if the original experimenters were honest and competent. In the larger picture, however, if the results of the original paper are to be considered relevant to anything, they should be reproducible under very broad circumstances.

  4. 1. Start with (ie) Aesop’s Fables.
    2.Invent a new language,(Klingon will do) and translate.(AKA plagiarize)
    3.Demand high fees for “interpretation” that only a “recognized credentialed expert”(in Klingon) can comprehend.(Do NOT cite credentials of issuing “authority”)
    4. Insert into to EVERY facet of “mandatory” law/gub’mint as lobbying will allow.
    5.Rinse, repeat.
    For best results, SEE: “Social” justice, and Aesop / Klingon “light”. Add hyphens… liberally

Leave a Reply

Your email address will not be published. Required fields are marked *