Replication of experimental results has become a hot issue in the behavioral sciences and medicine. While some some of this reflects fraud, its mostly about "sloppy" science. Some results may be flukes, but even valid results can present problems. I tend to think replication is needed because the observations are often obtained through procedures that are so complex that it is not clear what's central to the procedure and what's not. Mark Liberman at Language Log has a post on "Reliability" that speaks to these issues. Here's a passage (emphasis mine):
The general idea is that meaning is always negotiated and that experimental replication is an aspect of the negotiations.Some of the reasons for the problems are well known. There's the "file drawer effect", where you try many experiments and only publish the ones that produce the results you want. There's p-hacking, data-dredging, model-shopping, etc., where you torture the data until it yields a "statistically significant" result of an agreeable kind. There are mistakes in data analysis, often simple ones like using the wrong set of column labels. (And there are less innocent problems in data analysis, like those described in this article about cancer research, where some practices amount essentially to fraud, such as performing cross-validation while removing examples that don't fit the prediction.) There are uncontrolled covariates — at the workshop, we heard anecdotes about effects that depend on humidity, on the gender of experimenters, and on whether animal cages are lined with cedar or pine shavings. There's a famous case in psycholinguistics where the difference between egocentric and geocentric coordinate choice depends on whether the experimental environment has salient asymmetries in visual landmarks (Peggy Li and Lila Gleitman, "Turning the tables: language and spatial reasoning", Cognition 2002).