AwaitedOne
AwaitedOne

Reputation: 1012

Sample a data frame based on two columns

I have a data frame such as

df <- data.frame(matrix(rnorm(40), nrow=20))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=5)
df$score <- rep(c(1,2,3,5), each = 5)

I want to sample the rows based on two columns color and score into two data frames such that I get an almost equal number of rows from each group in each data frame. For example, I have 5 rows with the color blue and score 1. I want 2 in one data frame and 3 in another data frame. If I have sis rows in a group 3 should go to one data frame and 3 to another.

Upvotes: 0

Views: 582

Answers (1)

lroha
lroha

Reputation: 34291

If I've understood correctly, you can try something like:

set.seed(10)

df <- data.frame(matrix(rnorm(40), nrow=20))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=5)
df$score <- rep(c(1,2,3,5), each = 5)

library(dplyr)

df %>%
  group_by(color, score) %>%
  mutate(grp = sample(seq_along(score) %% 2)) %>%
  group_by(grp) %>%
  group_split()


[[1]]
# A tibble: 8 x 5
      X1     X2 color  score   grp
   <dbl>  <dbl> <chr>  <dbl> <dbl>
1  0.675  0.257 blue       1     0
2 -0.548  0.365 blue       1     0
3 -1.89   0.851 red        2     0
4  1.09  -0.173 red        2     0
5  1.65  -0.500 yellow     3     0
6 -0.186  0.564 yellow     3     0
7 -0.208 -1.70  pink       5     0
8  0.661  0.447 pink       5     0

[[2]]
# A tibble: 12 x 5
        X1      X2 color  score   grp
     <dbl>   <dbl> <chr>  <dbl> <dbl>
 1  0.0555  2.12   blue       1     1
 2 -0.738  -0.843  blue       1     1
 3  0.833  -0.939  blue       1     1
 4 -1.57   -0.172  red        2     1
 5  1.43    0.767  red        2     1
 6  1.14    1.32   red        2     1
 7  1.01    0.997  yellow     3     1
 8 -1.20   -0.357  yellow     3     1
 9  0.474  -0.0911 yellow     3     1
10 -2.44    0.765  pink       5     1
11  1.15    0.463  pink       5     1
12 -0.426   1.53   pink       5     1

Upvotes: 1

Related Questions