Reputation: 59
I want to be able to generate some confidence intervals for some test statistics using bootstrapping. What I would like to be able to do is to draw a bootstrapped dataset using sampling with replacement from my original dataset. I'm assuming that this would be a dataset of size n (where n is smaller than the size of the original dataset) that samples observations/rows of data from the full dataset with replacement (so that some rows may be drawn twice).
The code I have now for a single iteration is the following:
samp <- dat[sample(nrow(dat), 100000), ]
This code samples 100k rows from my dataset (dat).
My questions are the following:
Is this code sampling the rows with replacement? And is my assumption correct: a bootstrapped dataset using sampling with replacement is equivalent to sampling a dataset of size n (smaller than the original dataset) that randomly draws rows of data from the full dataset with replacement (is this bootstrapping with replacement)?
Upvotes: 0
Views: 4281
Reputation: 2143
This answers the first part of your question -
The code is not sampling with replacement to do this you will need to add replace=TRUE
as the default for sample
is to not replace, i.e. samp <- dat[sample(nrow(dat), 100000, replace=TRUE), ]
We can explore this with a test case. Firstly generate some data:
dat <- data.frame(Number=c(seq(1:10)))
then sample run the code you have given
samp <- dat[sample(nrow(dat),10,]
then see if any numbers have appeared more than one, i.e. they are duplicated:
duplicated(samp)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Run again including the replace=TRUE
argument:
samp <- dat[sample(nrow(dat),10,replace=TRUE),]
duplicated(samp)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
the TRUE
s mean there are duplications, i.e. replacement has happenend.
Upvotes: 1