dplyr solution to split dataset, but keep IDs in same splits

Question

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.

For example, imagine "test" below is an ID variable, and the dataset has many other columns.

test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
             val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.

The desired output would be something like

test[which(out==1),]

returning

Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.

Ryan Morton · Accepted Answer

You need to create a data frame, then use group_by and mutate to add columns:

test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
                 value = 1:16)
out <- test %>%
  mutate(new_column = ntile(id,3))
out

# A tibble: 16 x 3
      id value new_column
          
 1     1     1          1
 2     2     2          1
 3     3     3          1
 4     4     4          1
 5     4     5          1
 6     4     6          1
 7     4     7          2
 8     4     8          2
 9     6     9          2
10     7    10          2
11     8    11          2
12     9    12          3
13     9    13          3
14     9    14          3
15     9    15          3
16    10    16          3

Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:

test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
                 value = 1:16)
out <- test %>%
  distinct(id) %>%
  mutate(new_column = ntile(id,3)) %>%
  right_join(test, by = "id")
out
# A tibble: 16 x 3
      id new_column value
          
 1     1          1     1
 2     2          1     2
 3     3          1     3
 4     4          2     4
 5     4          2     5
 6     4          2     6
 7     4          2     7
 8     4          2     8
 9     6          2     9
10     7          2    10
11     8          3    11
12     9          3    12
13     9          3    13
14     9          3    14
15     9          3    15
16    10          3    16

dplyr solution to split dataset, but keep IDs in same splits

Answers (1)

Related Questions