How to randomly sample two columns based on percentages and assign labels?

Question

I have a dataframe that looks like this:

x   y   location
21 10   ny
12 22   ny
32 90   cha
33 14   cha 
...

I want to randomly sample the rows of x and y columns based on percentages. I want 30% of the rows of x and y to be randomly assigned group1 and 70% to be randomly assigned group2. Something like this:

x   y   location  group
21 10   ny        group1
12 22   ny        group2
32 90   cha       group2
33 14   cha       group2
...

I think I can do this with mutate() but I don't know how to write such code. Thank you for your help.

Ronak Shah · Accepted Answer

You can use sample and assign the probability of occurrence of group using the prob argument.

library(dplyr)

df <- df %>%
   mutate(group = sample(c('group1', 'group2'), n(), 
                          replace = TRUE, prob = c(0.3, 0.7)))

Since sample uses probability if you have 100 rows in df not necessarily exact 70 rows would always be assigned to 'group2'. As the number of rows increase this probability would take you closer to 70%.

If you want exact 70%-30% partition use rep instead.

n <- round(nrow(df) * 0.7)
df <- df %>% mutate(group = sample(rep(c('group1', 'group2'), c(n() - n, n))))

How to randomly sample two columns based on percentages and assign labels?

Answers (2)

Related Questions