Matt
Matt

Reputation: 185

For every id, randomly mark or select half the values from a dataframe column, to create two separate variables?

I want to create 2 variables per unique identifier (ID) from 1 column. I want to randomly select half of the values to be one variable and the remaining half to be the other varialbe. Below is a sample dataframe:

    Df1 <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3), 
              var = c(100, 200, 250, 400,425,250,80, 120, 210, 175,50,200,300, 90, 70, 500,400)

Any help will be greatly appreciated.

Thanks

Upvotes: 1

Views: 910

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193637

This seems like it should do what you are looking for:

set.seed(1)   # So you can reproduce my result

## Create an indicator column that will take the values of 0 and 1
## Initialize it with 0
Df1$ind <- 0

## Use `by` and `sample` to get half of the rows for each ID
## Assign "1" to the "ind" column for those rows
Df1$ind[unlist(by(1:nrow(Df1), Df1$ID, 
                  function(x) sample(x, ceiling(length(x)/2), FALSE)))] <- 1

## Create a "time" variable based on the "ID" and "ind" columns
Df1$time <- with(Df1, ave(ind, ID, ind, FUN = seq_along))

## Reshape the data (if required) into columns based on the indicator column
## The ID and time columns would serve as your unique IDs
library(reshape2)
dcast(Df1, ID + time ~ ind, value.var="var")
#   ID time   0   1
# 1  1    1 100 200
# 2  1    2 400 250
# 3  1    3 425 250
# 4  2    1  80 120
# 5  2    2 210 175
# 6  2    3  50 200
# 7  3    1 300  90
# 8  3    2 500  70
# 9  3    3  NA 400

Upvotes: 2

MrFlick
MrFlick

Reputation: 206401

If you don't mind if one column is systematically longer than the other, you can use

grp <- with(Df1, ave(ID, ID, FUN=function(x) sample(gl(2,1,length(x)))))

which will create a factor with levels 1 and 2 that you can use to subset the groups.

Df1[grp=="1", ]
Df1[grp=="2", ]

This will always put the extra sample in group 1. If you want to randomize the placement of the leftover, then maybe a helper function like this could help

markhalf <- function(x) {
  n <- floor(length(x)/2)
  z <- rep(c(1,2), each=n)
  if (length(x) %% 2==1) {
     z<- c(z, c(1,2)[sample(1:2, 1)])
  }
  sample(z)
}

and then use it with ave again

grp<-with(Df1, ave(ID, ID, FUN=markhalf))

Because both use sample, it should be a random assignment to each group.

Upvotes: 1

Vlo
Vlo

Reputation: 3188

There are a lot of sophisticated test/training data splitting functions in various libraries. Here is a very simple one based on random sample.

i = sample(1:nrow(Df1), size = floor(0.5*nrow(Df1)))
Df.set1 = Df1[i,]
Df.set2 = Df1[-i,]

Upvotes: 1

Related Questions