Prime Day Part 4 – avoiding a catch 22 – Dr Thod – Research and Analysis of Games and Algorithms

So how would we analyze the data? Let us start with the observed rate of Cubone and Ponyta and work backwards. Unfortunately this generates a catch 22. If we lump all data together, then we get an average shiny rate – no matter what happened.

Or the other way round – we just shift the number of neutral/ENL/RES portals and we can generate any shiny rate we want. Now we created a problem for proofing our hypothesis. By using the results to generate in which areas people checked the Pokémon, we generated a self-fulfilling prophecy. If players checked randomly across areas, then all our data we gather will show two flat rates of 1 in 120 and 1 in 160 which follow a perfect binominal distribution.

Did I just generate a Hypothesis that is impossible to disproof with data gathered but also impossible to proof? I wouldn’t write this article if that would be the case. If you are not interested in stats and Maths then just skip the next few parts. I don’t think I have a quick way to explain it. But if you want to dive deeper – carry on reading all of it.

Random distributions follow clear rules. There are some more hidden artifacts expected in the data we could check for that distinguish a flat rate and a mixed rate. The key is – if players checked randomly neutral, ENL and RES portals, then we are completely out of luck. To a certain degree we can expect players to have done just that. We didn’t know – so we just walked around instead of targeting Enlightened or Resistance hot-spots. But player behaviour is not random. They stay in a small area of portals. And small areas of portals in Ingress are not randomly 20:20:60 ENL:RES:neutral.

I have generated a table with some player stereo-types. These are simplifications. But they lead to observable signals in the data.

Player definition	What play pattern would result in this	Data to expect
Players checking Pokémon in all three different areas – Enlightened/Resistance/No faction.	Players moving around across areas.	A single flat rate following a binominal distribution. If we have only data from these players, then it will be impossible to proof/disproof the hypothesis.
Players checking Pokémon predominately in a single area – either Enlightened / Resistance/No faction.	Players staying in a single place or at least never move out of one area.	We expect a 1 in 50 rate for either Cubone or Ponyta and a zero rate (or close to zero) for the other species or we expect a zero rate (or close) overall for both.
Players with a mix of the two extreme behaviour above.	Players playing more in the area of one faction but playing some time in other areas as well.	These players will dampen any signal we have and we have to be careful averaging over all data.

So what are the player behaviors that will lead to data points we should be looking at that break the binominal distribution we would expect in case of a flat rate?

Behavior	Amount of data gathered	Likeliness and usefulness for analysis
Staying in one faction area only	50-150 data points	There will be quite several players expected with this play pattern. We could look at p-values for the luckiest players in this chunk of data. The values should be reasonable in regard to a 1 in 50 rate but would seem very unlikely for a 1 in 150 rate.
Staying in one faction area only	400+ data points	Players who played that much are more likely not to have stayed in a single place. But the few we get could give us p-values for 1 in 150 which are very unlikely.
Staying in no faction area all the time	50-150 data points	There will be quite several players expected with this play pattern. Unfortunately, the data will be useless as.
Staying in no faction area all the time	400+ data points	There might be quite a few players who suffered this fate if they play in an area where no Ingress play happens.

Looking at the Maths using binominal distributions

Pokémon checked	Shiny found	Expected in Ingress Model	Expected using flat rate model	Signal strength
50	0	0.364	0.716	2
50	>1	0.264	0.044	6
50	>2	0.078	0.005	17
50	>3	0.018	0.000	50
150	0	0.048	0.367	8
150	>2	0.579	0.080	7
150	>3	0.353	0.019	19
150	>4	0.183	0.004	52
150	>5	0.082	0.001	148
400	10+	0.282112	0.000429	658
600	10+	0.760255	0.007913	96
400	15+	0.016247	0.000000	124745
600	15+	0.226137	0.000018	12482
400	0	0.000309	0.068866	223
600	0	0.000005	0.018072	3322

It is important to state that we can’t just lump all data together and compare the found shiny against a binominal distribution. Between 50-150 observations it is more likely for the Ingress model to show 2-3 shiny found. Between 250-500 datapoints it is less likely to find 2-3 shiny in our data. Unfortunately, we don’t have 100 data points and exactly 100 and 100 data points at exactly 500 observations. But where is the fun in statistical analysis if it is easy.

I also show this here to justify that the way I split the data in the following analysis isn’t random but rather targeted to maximise the chance to get a good statistical signal.

There are three signals that we expect with a decent frequency if ingress portals did influence the shiny rates and that are very to extremely unlikely if there was a flat rate:

Players with a small number of Pokemon caught and a high number of shiny
Players with a large number of Pokemon caught but zero shiny
Players with a large number of Pokemon caught and a lot of shiny

So finally it is time to look at the data we have.