Reputation: 185
I want to create 2 variables per unique identifier (ID) from 1 column. I want to randomly select half of the values to be one variable and the remaining half to be the other varialbe. Below is a sample dataframe:
Df1 <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3),
var = c(100, 200, 250, 400,425,250,80, 120, 210, 175,50,200,300, 90, 70, 500,400)
Any help will be greatly appreciated.
Thanks
Upvotes: 1
Views: 910
Reputation: 193637
This seems like it should do what you are looking for:
set.seed(1) # So you can reproduce my result
## Create an indicator column that will take the values of 0 and 1
## Initialize it with 0
Df1$ind <- 0
## Use `by` and `sample` to get half of the rows for each ID
## Assign "1" to the "ind" column for those rows
Df1$ind[unlist(by(1:nrow(Df1), Df1$ID,
function(x) sample(x, ceiling(length(x)/2), FALSE)))] <- 1
## Create a "time" variable based on the "ID" and "ind" columns
Df1$time <- with(Df1, ave(ind, ID, ind, FUN = seq_along))
## Reshape the data (if required) into columns based on the indicator column
## The ID and time columns would serve as your unique IDs
library(reshape2)
dcast(Df1, ID + time ~ ind, value.var="var")
# ID time 0 1
# 1 1 1 100 200
# 2 1 2 400 250
# 3 1 3 425 250
# 4 2 1 80 120
# 5 2 2 210 175
# 6 2 3 50 200
# 7 3 1 300 90
# 8 3 2 500 70
# 9 3 3 NA 400
Upvotes: 2
Reputation: 206401
If you don't mind if one column is systematically longer than the other, you can use
grp <- with(Df1, ave(ID, ID, FUN=function(x) sample(gl(2,1,length(x)))))
which will create a factor with levels 1 and 2 that you can use to subset the groups.
Df1[grp=="1", ]
Df1[grp=="2", ]
This will always put the extra sample in group 1. If you want to randomize the placement of the leftover, then maybe a helper function like this could help
markhalf <- function(x) {
n <- floor(length(x)/2)
z <- rep(c(1,2), each=n)
if (length(x) %% 2==1) {
z<- c(z, c(1,2)[sample(1:2, 1)])
}
sample(z)
}
and then use it with ave
again
grp<-with(Df1, ave(ID, ID, FUN=markhalf))
Because both use sample
, it should be a random assignment to each group.
Upvotes: 1
Reputation: 3188
There are a lot of sophisticated test/training data splitting functions in various libraries. Here is a very simple one based on random sample.
i = sample(1:nrow(Df1), size = floor(0.5*nrow(Df1)))
Df.set1 = Df1[i,]
Df.set2 = Df1[-i,]
Upvotes: 1