Over-fitting – The Bane of all Data Mining (Motivation Decay Part 2) – Dr Thod – Research and Analysis of Games and Algorithms

In the age of AI and automated data-mining a lot of readers might think all it takes is a good statistical fit to ‘proof’ that you found the right correlation. I stumbled upon a very interesting article on the BBC webpages just when I contemplated to start this blog AAAS: Machine learning ‘causing science crisis’ .

Below is a snapshot from me working on the motivation decay algorithm:

All of the graphs show different attempts to find a good correlation. The circled number is an R^2 value of 0.9995 – lots of others are >0.99. So why did I disregard all of these models and never published them / carried on researching the ‘true’ algorithm?

I can’t explain them
Raising the complexity of a model to raise the statistical fit is bad

With hindsight some of the models actually were pretty close to the real solution. Only a few ‘minor’ issues in my training data prevented me from finding the correct solution.

A single data point – CP 103 is off by 10% (1.01% instead of 1.11%) – I will deal with this in Part 3
The actual algorithm is discontinuous – meaning there is no single function for the whole area. I will deal with this in Part 4.
I’m missing data around the discontinuation point (lack of data between CP 250- 350). I will deal with this in Part 5