Reputation: 7
Hi I have a question about R.
Actually I have a population of 200 employees and i know the mean and sd of for the whole population (working hours).
The following must be repeated 400 times:
1) Collect small random sample of 6 people in the population.
2) Construct a 90% level confidence interval for mean (μ) (assume that the population size is infinite)
3) Among the 400 confidence intervals constructed in 2), how many do not contain the value of mean (μ) of the whole population.
I collected sample and all but i am unable to build confidence intervals.
Here is what i have done so far:
> population<-data$hours01
> n<-6
> Vect <- rep(0,400)
> for(i in 1:400){
+ ech <- sample(population,n)
+ right[i]<-(mean(ech)) + 1.645*(((sd(ech))/sqrt(n)))
+ left[i]<-(mean(ech)) - 1.645*(((sd(ech))/sqrt(n)))
Here are the Data
heur01
1411
1734
1048
2060
1983
1810
1387
1637
1419
1637
1185
1766
1484
1983
1217
1915
1846
1887
1742
988
1375
1193
2056
1919
1850
2076
1463
1113
1887
1919
1734
1157
1766
1951
1923
2173
1609
1895
1109
1028
1701
1875
1677
1653
1883
1677
1850
1738
1520
1415
1992
1919
1653
1625
1705
1742
1891
2108
1919
1911
1770
1834
1911
2060
1717
1943
1859
1738
1222
1709
2052
1141
1931
2068
2044
1725
1818
1798
1943
1939
1919
1790
2116
1750
2052
1605
1798
2169
1665
1673
1185
1717
1717
1657
1915
1778
2121
1786
1774
2056
1738
1883
1754
1790
1770
1947
1867
1794
1867
1790
1762
2080
1778
1903
1734
1838
1560
1592
1637
1467
1750
1653
1222
1709
1806
1334
1584
2052
1802
1774
1770
1258
1334
1322
1826
1600
2189
1907
1548
1617
1693
1020
992
1435
1613
1738
1419
1121
1629
1605
1455
1157
1717
1294
1359
1282
1758
1395
1129
1189
1790
1217
1133
1516
1516
1278
1072
911
1286
968
1076
1315
1221
1268
939
1879
986
1221
1456
1315
1785
1080
1362
1503
1127
1691
1174
1644
1691
939
1503
1080
1503
1832
1362
1691
1456
1879
1644
1033
Upvotes: 0
Views: 260
Reputation: 43354
You can build a function to calculate the confidence interval, and then apply it to samples with replicate
to generate a matrix of confidence intervals, which you can check against the population mean.
There is a possible complication: when standard deviation is unknown, confidence intervals are calculated with the t distribution, but if it is, the cumulative normal is used. If the degrees of freedom is relatively large, it will make very little difference, but given that it will be only 5 for each sample, the difference matters here.
Thus, to build a robust function for the confidence interval, you would need something like
ci <- function(x, conf.level, sd = NULL){
conf.level <- mean(c(conf.level, 1))
mean.x <- mean(x)
if (is.null(sd)) { # when standard deviation unknown,
sd <- sd(x) # use sample standard deviation
z <- qt(conf.level, length(x) - 1) # and t distribution
} else {
z <- qnorm(conf.level) # when known, use normal
}
int <- z * sd / sqrt(length(x))
c(low = mean.x - int,
high = mean.x + int)
}
To try it out,
set.seed(47) # make sampling reproducible
# make a matrix of confidence intervals
ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))
ints[, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> low 1443.959 1441.625 1376.459 1486.625 1436.959
#> high 1865.041 1862.708 1797.541 1907.708 1858.041
# calculate number of intervals that don't contain mean
mean.x <- mean(heur01)
sum(mean.x < ints[1,] | mean.x > ints[2,])
#> [1] 37
To see that it is, in fact, different when standard deviation isn't specified,
set.seed(47)
with_sd <- replicate(100, {
ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))
sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(with_sd)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 27.0 34.0 37.0 37.5 41.0 50.0
set.seed(47)
no_sd <- replicate(100, {
ints <- replicate(400, ci(sample(heur01, 6), .9))
sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(no_sd)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 29.00 43.00 46.00 47.07 52.00 66.00
t.test(with_sd, no_sd)
#>
#> Welch Two Sample t-test
#>
#> data: with_sd and no_sd
#> t = -11.472, df = 187.14, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -11.215668 -7.924332
#> sample estimates:
#> mean of x mean of y
#> 37.50 47.07
Data
heur01 <- c(1411L, 1734L, 1048L, 2060L, 1983L, 1810L, 1387L, 1637L, 1419L, 1637L, 1185L, 1766L, 1484L, 1983L,
1217L, 1915L, 1846L, 1887L, 1742L, 988L, 1375L, 1193L, 2056L, 1919L, 1850L, 2076L, 1463L, 1113L, 1887L,
1919L, 1734L, 1157L, 1766L, 1951L, 1923L, 2173L, 1609L, 1895L, 1109L, 1028L, 1701L, 1875L, 1677L, 1653L,
1883L, 1677L, 1850L, 1738L, 1520L, 1415L, 1992L, 1919L, 1653L, 1625L, 1705L, 1742L, 1891L, 2108L, 1919L,
1911L, 1770L, 1834L, 1911L, 2060L, 1717L, 1943L, 1859L, 1738L, 1222L, 1709L, 2052L, 1141L, 1931L, 2068L,
2044L, 1725L, 1818L, 1798L, 1943L, 1939L, 1919L, 1790L, 2116L, 1750L, 2052L, 1605L, 1798L, 2169L, 1665L,
1673L, 1185L, 1717L, 1717L, 1657L, 1915L, 1778L, 2121L, 1786L, 1774L, 2056L, 1738L, 1883L, 1754L, 1790L,
1770L, 1947L, 1867L, 1794L, 1867L, 1790L, 1762L, 2080L, 1778L, 1903L, 1734L, 1838L, 1560L, 1592L, 1637L,
1467L, 1750L, 1653L, 1222L, 1709L, 1806L, 1334L, 1584L, 2052L, 1802L, 1774L, 1770L, 1258L, 1334L, 1322L,
1826L, 1600L, 2189L, 1907L, 1548L, 1617L, 1693L, 1020L, 992L, 1435L, 1613L, 1738L, 1419L, 1121L, 1629L,
1605L, 1455L, 1157L, 1717L, 1294L, 1359L, 1282L, 1758L, 1395L, 1129L, 1189L, 1790L, 1217L, 1133L, 1516L,
1516L, 1278L, 1072L, 911L, 1286L, 968L, 1076L, 1315L, 1221L, 1268L, 939L, 1879L, 986L, 1221L, 1456L,
1315L, 1785L, 1080L, 1362L, 1503L, 1127L, 1691L, 1174L, 1644L, 1691L, 939L, 1503L, 1080L, 1503L, 1832L,
1362L, 1691L, 1456L, 1879L, 1644L, 1033L)
Upvotes: 1