Reputation: 1133
I have some code that allows me to take two randomly drawn samples from a dataset, apply a function and repeat the procedure a certain number of times (see below code from associated question: How to bootstrap a function with replacement and return the output).
Example data:
> dput(a)
structure(list(index = 1:30, val = c(14L, 22L, 1L, 25L, 3L, 34L,
35L, 36L, 24L, 35L, 33L, 31L, 30L, 30L, 29L, 28L, 26L, 12L, 41L,
36L, 32L, 37L, 56L, 34L, 23L, 24L, 28L, 22L, 10L, 19L), id = c(1L,
2L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 16L, 17L, 18L, 19L, 20L, 21L, 21L, 22L, 23L, 24L,
25L)), .Names = c("index", "val", "id"), class = "data.frame", row.names = c(NA,
-30L))
Code:
library(plyr)
extractDiff <- function(P){
subA <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a random sample of 15 rows
subB <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a second random sample of 15 rows
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
set.seed(42)
fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))
Rather than taking TWO randomly drawn samples of size 15, I would like to take one randomly drawn sample of size 15, then extract the remaining 15 rows in the dataset after the first random draw has been taken (i.e. subA
would equal the first randomly drawn sample of 15 obs, subB
would equal the remaining 15 obs after subA has been taken). I am really not sure how to go about doing this. Any help would be really appreciated. Thanks!
Upvotes: 0
Views: 888
Reputation: 2094
I believe you can do this by making a small change to your code as so.
extractDiff <- function(P){
sampleset = sample(nrow(P), 15, replace=FALSE) #select the first 15 rows, note replace=FALSE
subA <- P[sampleset, ] # takes the 15 selected rows
subB <- P[-sampleset, ] # takes the remaining rows in the set
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
However, please note that this is not compatable with boot strapping as bootstrapping requires replacement. If on the other hand you want to sample with replacement from the data set, and then sample with replacement from the dataset not selected in the first sampling you could do the following.
extractDiff <- function(P){
sampleset1 = sample(nrow(P), 15, replace=TRUE) #select the first 15 rows, note replace=TRUE
sampleset2 = sample((1:nrow(P))[-unique(sampleset1)],15,replace=TRUE) #selects only from rows not used in sampleset1
subA <- P[sampleset1, ] # takes the 15 selected rows
subB <- P[sampleset2, ] # takes the 15 selected rows in the remaining set set
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
However this still may not be ideal depending on your application as the second dataset is more likely to have multiple instances of a value than the first. If you were selecting a smaller proportion of the total set it would be much less of a problem. You may be better off dividing the set into two using 'shuffle' and sampling with replacement from both halves so the two sets are more even, but this will prevent the first set from being a true boot strapping set again.
Upvotes: 1
Reputation: 2950
In that case, I would just shuffle up the row numbers of P
(stored in index
below) and then choose the first 15 for subA
and the second 15 for subB
:
library(plyr)
extractDiff <- function(P){
index <- sample(seq_len(nrow(P)),replace = FALSE)
subA <- P[index[1:15], ] # takes a random sample of 15 rows
subB <- P[index[16:30], ] # takes a second random sample of 15 rows
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
set.seed(42)
fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))
Upvotes: 1