Reputation: 27
I'm actually creating a package for the Benford Law (for academic purpose). And I'm trying to perform a goodness of fit with the "chisq.test".
I've this vector :
prop = [1377 803 477 381 325 261 253 224 184]
That I want to compare with this vector of probabilities (1st digit from Benford Law) :
th = [0.301 0.176 0.125 0.097 0.079 0.067 0.058 0.051 0.046]
Thus, I perfom the test :
chisq.test(prop,p=th)
Then, if I understood the purpose of the test correctly, it should return a big p-value (close to 1 rather than 0) because proportions from the data (prop) is really similar to the theoric proportions (th) , but the output give me :
"Chi-squared test for given probabilities data: prop X-squared = 22.044, df = 8, p-value = 0.004835"
Thus, if someone can help me to understand it gave this low p-value ?
Thanks a lot
PS :
I performed "chisq.benftest" (Pearson's Chi-squared Goodness-of-Fit Test for Benford's Law) with the same data and it gave me a more coherent p-value (0.7542), thus I should have a done a mistake somewhere, but I don't know where.
Upvotes: 0
Views: 209
Reputation: 909
I think the low p-value is because you have a good number of measurements, and the data just doesn't fit the theoretical expectation well enough.
If you had fewer measurements there would be more uncertainty and you would get higher p-values.
chisq.test(prop/2, p=th) # p-value = 0.1916
chisq.test(prop/3, p=th) # p-value = 0.4884
chisq.test(prop/4, p=th) # p-value = 0.6929
chisq.test(prop/5, p=th) # p-value = 0.8121
To see where the algorithm is finding most discrepancy, you can plot the chi-gram like this:
barplot(prop - (sum(prop) * th)) / sqrt(sum(prop) * th)
This is a plain R example, checking goodness of fit against an equal distribution:
a <- c(11, 9)
t <- c(0.5, 0.5)
chisq.test(a,p=t)
That gives p-value 0.6547 because it's a rather small number of measurements, and only 1 degree of freedom.
But if you run the same test, with the same proportions, with a larger and larger number of observations, the p-value keeps falling:
chisq.test(a*3,p=t) # p-value = 0.4386
chisq.test(a*10,p=t) # p-value = 0.1573
chisq.test(a*20,p=t) # p-value = 0.0455
Your original data really does look very close to the theory when you plot it. But there are many degrees of freedom and you have a lot of observations.
The same principle does apply to other inferential statistics. More observations means the algorithms become more certain about how well the sample represents the population.
Upvotes: 1