Kaitlin
Kaitlin

Reputation: 59

Sampling random rows of a dataframe in R with replacement

I want to be able to generate some confidence intervals for some test statistics using bootstrapping. What I would like to be able to do is to draw a bootstrapped dataset using sampling with replacement from my original dataset. I'm assuming that this would be a dataset of size n (where n is smaller than the size of the original dataset) that samples observations/rows of data from the full dataset with replacement (so that some rows may be drawn twice).

The code I have now for a single iteration is the following:

samp <- dat[sample(nrow(dat), 100000), ]

This code samples 100k rows from my dataset (dat).

My questions are the following:

Is this code sampling the rows with replacement? And is my assumption correct: a bootstrapped dataset using sampling with replacement is equivalent to sampling a dataset of size n (smaller than the original dataset) that randomly draws rows of data from the full dataset with replacement (is this bootstrapping with replacement)?

Upvotes: 0

Views: 4281

Answers (1)

dpel
dpel

Reputation: 2143

This answers the first part of your question -

The code is not sampling with replacement to do this you will need to add replace=TRUE as the default for sample is to not replace, i.e. samp <- dat[sample(nrow(dat), 100000, replace=TRUE), ]

We can explore this with a test case. Firstly generate some data:

dat <- data.frame(Number=c(seq(1:10)))

then sample run the code you have given

samp <- dat[sample(nrow(dat),10,]

then see if any numbers have appeared more than one, i.e. they are duplicated:

duplicated(samp)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Run again including the replace=TRUE argument:

samp <- dat[sample(nrow(dat),10,replace=TRUE),]
duplicated(samp)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

the TRUEs mean there are duplications, i.e. replacement has happenend.

Upvotes: 1

Related Questions