Is Machine Learning and AI solving the problem of p-hacking?
Does using machine learning solve our problem of p-hacking and HARKing or do we have the same problems as with statistical tests and models?
As a statistician I have been trained early on in understanding statistical tests and with that the meaning, history and problems of the p-value. Every researcher who works with data has seen a p-value and most have even printed one in their own papers. Most researchers, including statisticians, struggle understanding the p-value.
Not only do we struggle to understand the p-value, but we also misuse it. Fishing for significance, p-hacking, cherry-picking, and HARKing (hypothesizing after results are known) are the words we use to describe the misuse. This post is not about these words though, it is about the deeper issue: Our misuse of the p-value is based on a wish to generate exciting results. And we at some point started equating exciting with small p-values.
Researchers in many disciplines have argued that p-values should be avoided. In 2015 one journal banned p-values (see https://doi.org/10.1080/01973533.2015.1012991).

But it’s not the p-values’ fault. The p-value is just a number computed from a statistical test. It’s the humans who are the problem.
- We humans like simple binary decisions: Is this result interesting or not interesting?
- We humans invented rules that only interesting research lead to promoting the researcher and giving them prizes.
- We humans value our careers over being 100% accurate with our statistics.
- We humans …
And I am not here to blame any single person. I write “we”, because I’ve made the same mistakes. It’s just how humans work.
So, finally, I want to get to the question I ask in the title of this post: Is Machine Learning and AI solving the problem of p-hacking?
And you might have guessed it already. Of course the answer is NO! We might not be “fishing for significance” anymore, but we’re still the same flawed humans. We choose different metrics to show our results are exciting. Performance metrics of models, for example. Different numbers, same problems. Maybe even bigger problems. The performance metrics can be tampered even more easily than p-values:
- Choose a different metric (there are so many!)
- Create a different training, test, validation split
- Set a different seed for cross-validation
- Evaluate the new machine learning method on a different (simulated) data set
- …
And it happens without us even realizing what we’re doing. That we are involved in questionable research practices. Whether we talk about supervised machine learning or generative AI. The problem is the same.
The remedies for the issues described with p-values and machine learning are practically the same:
- Teach people so they understand the problem.
- Preregister research.
- Give people credit for the methods not the outcome.
I hope this post was helpful to you.
Please note: this is my opinion and I probably forgot some stuff in this post. So please be kind to me, make me aware of the faults and I’ll happily improve.
All the best, Heidi