Reputation: 691
I have a vector of 30 samples I want to test the hypothesis of the sample being from a population which is normally distributed.
> N.concentration
[1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
[13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
[25] 0.141 0.100 0.104 0.078 0.121 0.104
I made a frequency vector using hist
> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
[1] 3 5 4 4 5 4 2 2 1
I'm using chisq.test
to check the fitness of N.freq
to a normal distribution, however, the function requires an argument p = a vector of probabilities of the same length of x, as defined in chisq.test documentation. I'm trying to generate a vector to it but, honestly, I don't know exactly what to generate. I'm trying
> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
[1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
[6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
Error in chisq.test(p1$count, p = p) :
probabilities must sum to 1.
I thought about using rescale.p=TRUE
but I'm not sure if this will produce a valid test.
EDIT: If I use rescale.p, I got a warning message
> chisq.test(N.freq, p=p, rescale.p=TRUE)
Chi-squared test for given probabilities
data: N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16
Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect
Upvotes: 0
Views: 4020
Reputation: 226547
There are many statistical tests designed for testing departure from Normality of a specified set of data (e.g., see the nortest package). However, you should be aware that many statisticians feel that normality testing is "essentially useless": in particular (from an answer from the linked CrossValidated question):
The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.
However, going ahead and using the Shapiro-Wilk test from base R (according to the Wikipedia page, Shapiro-Wilk has good power - but note from discussion above that high power is not necessarily what we really want in this case ...)
d <- c(0.164,0.045,0.069,0.100,0.050,0.080,0.043,0.036,0.057,0.154,
0.133,0.193,0.129,0.121,0.081,0.178,0.041,0.040,0.116,0.078,
0.104,0.095,0.116,0.038,0.141,0.100,0.104,0.078,0.121,0.104)
shapiro.test(d)
## Shapiro-Wilk normality test
##
## data: d
## W = 0.9547, p-value = 0.2255
Graphical approach:
par(las=1,bty="l")
qqnorm(d)
qqline(d)
The points follow the line reasonably well, and the largest deviations (the three smallest points in the data set) are actually larger than expected, which means the data set is slightly thin-tailed at the lower end, meaning that tests based on the assumption of Normality will typically be slightly conservative.
Upvotes: 2
Reputation: 73385
As I said, to test normality we have to know the mean and standard error of the normal distribution in Null Hypothesis. Since there are no given values, we have to estimate them from your 30 data.
x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057,
0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04,
0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104,
0.078, 0.121, 0.104)
mu <- mean(x)
sig <- sd(x)
Now, as what you have done, we need to bin the data:
h <- hist(x, breaks = 10)
#List of 6
# $ breaks : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
# $ counts : int [1:9] 3 5 4 4 5 4 2 2 1
# $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
# $ mids : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
# $ xname : chr "x"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"
To get the true probability under Null Hypothesis, we need probability for each bin cell, i.e., between breaks.
p <- diff(pnorm(h$breaks, mu, sig))
#[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
#[8] 0.04552387 0.01921839
I tend not to trust chi-square test with only 30 data. But here is how we can use chisq.test
:
chisq.test(h$counts, p = p, rescale.p = TRUE)
#
# Chi-squared test for given probabilities
#
#data: h$counts
#X-squared = 3.1476, df = 8, p-value = 0.9248
#
#Warning message:
#In chisq.test(h$counts, p, rescale.p = TRUE) :
# Chi-squared approximation may be incorrect
Often you need not bother the warning message. If you want to get rid of it, set simulate.p.value = TRUE
:
chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
#
# Chi-squared test for given probabilities with simulated p-value (based
# on 2000 replicates)
#
#data: h$counts
#X-squared = 3.1476, df = NA, p-value = 0.942
Upvotes: 3