unaeem
unaeem

Reputation: 441

R: Creating groups based on running totals that are non-repeating

Extension to my previous question

Creating groups based on running totals against a value

Previous problem: "I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group"

Extension to the problem: What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?

Previously, I used two answers by @tmfmnk

df %>% 
 mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
        Group = cumsum(Cumsum >= 45),
        Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)

and by @G. Grothendieck

Accum <- function(acc, x) if (acc < 45)  acc + x else x
r <- Reduce(Accum, DF$Z, accumulate = TRUE)
g <- rev(cumsum(rev(r) >= 45))
g <- max(g) - g + 1

transform(DF, Cumsum = r, Group = g)

Both codes can solve this first problem.

My data looks like this


I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.

My data looks something like this

ID  X   Y   Z
1   A   A   1
2   A   B   5
3   A   C   2
4   A   D   42
5   A   E   10
6   A   F   2
7   A   G   0
8   A   H   3
9   A   I   0
10  A   J   8
11  A   K   19
12  B   L   4
13  B   M   1
14  B   N   1
15  B   O   2
16  B   P   0
17  B   Q   1
18  B   R   2

I want something like this

ID  X   Y   Z   CumSum  Group
1   A   A   1   1   1
2   A   B   5   6   1
3   A   C   2   8   1
4   A   D   42  50  1
5   A   E   10  10  2
6   A   F   2   12  2
7   A   G   0   12  2
8   A   H   3   15  2
9   A   I   0   15  2
10  A   J   8   23  2
11  A   K   19  42  2
12  B   L   3   3   3
13  B   M   1   4   3
14  B   N   1   5   3
15  B   O   2   7   3   
16  B   P   0   7   3
17  B   Q   1   8   3
18  B   R   2   9   3

Kindly, let me know what can be done.

Upvotes: 1

Views: 509

Answers (1)

DSGym
DSGym

Reputation: 2867

Maybe not the sexiest solution, but I guess it does what you want.

Use a split apply combine approach with the new group_split function from R. Define a maxval which tracks the number of groups and always adds up in the next dataframe

 df = data.frame(
        ID = c(1:18),
        X = c(rep("A", 11), rep("B", 7)),
        Y = LETTERS[1:18],
        Z = c(1,5,2,42,10,2,0,3,0,8,19,4,1,1,2,0,1,2)
    )

library(dplyr)


listofdfs <- df %>% 
    group_split(X)
listofdfs

maxval = 0

for(i in 1:length(listofdfs)) {
    listofdfs[[i]] <- listofdfs[[i]] %>%
        mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
               Group = cumsum(Cumsum >= 45),
               Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1 + maxval)
    maxval <- max(listofdfs[[i]]$Group)
}

listofdfs

result <- rbindlist(listofdfs)
result


    ID X Y  Z Cumsum Group
 1:  1 A A  1      1     1
 2:  2 A B  5      6     1
 3:  3 A C  2      8     1
 4:  4 A D 42     50     1
 5:  5 A E 10     10     2
 6:  6 A F  2     12     2
 7:  7 A G  0     12     2
 8:  8 A H  3     15     2
 9:  9 A I  0     15     2
10: 10 A J  8     23     2
11: 11 A K 19     42     2
12: 12 B L  4      4     3
13: 13 B M  1      5     3
14: 14 B N  1      6     3
15: 15 B O  2      8     3
16: 16 B P  0      8     3
17: 17 B Q  1      9     3
18: 18 B R  2     11     3

Upvotes: 1

Related Questions