Reputation: 417
I am randomly sampling participants from an original data frame, then I would like to create new data frames, excluding one sample and keeping the rest (just note the dataframe is much larger with more variables and more observations for each id).
Sample df:
id var1 var2
1 10 15
1 10 15
2 11 4
2 11 4
3 12 4
3 12 4
4 9 10
4 9 10
#randomly sample two sets of id
id <- as.numeric(as.character(df$id))
fold1 <- as.data.frame(sample(id, 2, replace=TRUE))
colnames(fold1) <- "id"
fold2 <- as.data.frame(sample(id, 2, replace=TRUE))
colnames(fold2) <- "id"
Desired output
df.new1:
id var1 var2
2 11 4
2 11 4
3 12 4
3 12 4
df.new2:
id var1 var2
1 10 15
1 10 15
4 9 10
4 9 10
I tried something along these lines, but there seems to be some issue with my syntax I can't quite figure out. If there's a dplyr implementation I would be really happy to see it.
list = c(fold1, fold2)
for(i in length(list)) {
df.new <- as.data.frame(df[!(df$id %in% list[i]$id), ])
assign(paste("df.new", i, sep="."), df.new)
}
**Edit: I slightly modified the example to reflect the fact that each draw should sample a proportion of the total number of id's and in total the number of id's sampled should equal the total number of id's in the df. So if there are 4 id's, each draw should contain 2 id's.
Upvotes: 0
Views: 871
Reputation: 7153
Example if you have a sample data, with 60 id each with one value:
df <- data.frame(id=1:60, val=sample(rep(letters, 3), 60))
To get the id for 5 subset data, each with 12 ids:
set.seed(1)
draw <- sample(1:60, 60, replace=FALSE)
id <- split(draw, rep(1:5, each=12))
Using lapply to subset based on the id:
output <- lapply(id, function(x)df[df$id %in% x, ])
#e.g.
output[[1]]
# id val
# 4 4 y
# 9 9 f
# 11 11 x
# 12 12 e
# 16 16 o
# 22 22 o
# 33 33 d
# 34 34 n
# 36 36 r
# 50 50 s
# 52 52 p
# 57 57 p
Upvotes: 1