How reliable are customer behavior experiments? A 2015 study published in Science by the Reproducibility Project produced grim results. Only 36 out of 100 studies that researcher tried to replicate produced statistically significant results.

A statistically significant study will have a p-value less than 0.05, meaning if you did the study again, your chance of getting the same result by accident would be less than 1 in 20. But is the 0.05 benchmark really a meaningful measure of anything? It creates a sharp distinction at an arbitrary—some say meaningless—point. Is an experiment that barely meets the criteria somehow magically more relevant than one that barely fails to meet it?

And then there’s the question of effect size, the strength of a phenomenon. If an experiment shows that customers prefer text messages to phone contacts for certain interactions,  effect size tells us how much more they prefer it. But again, the reproducibility study’s results were worrisome. On average, the reproduced experiments found effect sizes that were half of what the originals reported.

It’s hard to interpret failed replications, though. Cognitive psychology studies (memory, learning, attention), for example, were twice as likely to be successfully replicated as social psychology studies (how people influence each other). The effect sizes for both decline when replicated, but maybe that’s just because cognitive experimenters find larger effects than social ones, to begin with, because social psychology is more sensitive to context. Maybe people vary so much that it’s just harder to find a signal in the noise.

But is irreproducibility reproducible? There are a lot of reasons why two experiments might produce different results. The original could be flawed, but so could the replication. There’s always random chance, of course. There could also be differences in the people who volunteered or differences in the way each experiment was carried out. For that matter, those trying to reproduce the original experiment might simply not have the technical skill or knowledge to do it the same way, or even right.

But, of course, that’s a very good argument for making sure that experiments are repeated to ensure they’re replicable.



