Reputation: 1161
I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
Upvotes: 0
Views: 641
Reputation: 2695
You need to create a data frame, then use group_by
and mutate
to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile
function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16
Upvotes: 1