Reputation: 2770
I am trying to collapse categories until a condition is met. I have simulated some data. In cases in which "N" is less than 10 I want to take the sum of "wt" grouped by that level of group2 and the next highest level. On row one - "N" equals 0, so I want to sum "wt" for row one and row two. The sum of "N" in row 4 and row 5 is also less than 10 so I want to sum "wt" for rows 3,4 and 5. I know how to use group by in dplyr but don't know how to do it stipulated under conditions.
a <-expand.grid( group2=c( 1:5 ) , group1=c( "F","M" ) )
a$N <- c( 0 ,12, 15, 2, 5 ,9 , 10 , 11 , 12 , 15)
a$wt =c( 12 ,23 ,45 , 5 , 1 , 11 ,8 , 9 ,12, 27 )
a$row <- 1:10
So I have thought about writing a loop for each observation to look to the next row - but that seems clunky.
without a group by argument I just get the sum of all the places where"N" is than 10
a %>%
filter( N < 10 ) %>%
mutate( Wt2 = sum( wt ) )
Upvotes: 1
Views: 93
Reputation: 39154
Here is a solution using dplyr and data.table.
First, we can design a function, check_fun
to see if there are any two or more consecutive rows with N
below 10. TRUE
indicates aggregation is needed.
library(dplyr)
library(data.table)
check_fun <- function(df){
df2 <- df %>%
mutate(Below10 = rleid(N < 10)) %>%
filter(N < 10) %>%
count(group1, Below10)
return(any(df2$n > 1))
}
check_fun(a)
# [1] TRUE
We can then design a second function, aggregate_fun1
, which conduct the aggregation to the next row.
aggregate_fun1 <- function(df){
df2 <- df %>%
mutate(Below10 = rleid(N < 10)) %>%
group_by(Below10) %>%
mutate(Index1 = ifelse(N >= 10, row_number(), NA)) %>%
mutate(Index2 = ifelse(N < 10, row_number(), NA)) %>%
mutate(Index2 = ifelse(Index2 == 2, 1, Index2)) %>%
group_by(group1, Below10, Index1, Index2) %>%
summarize(N = sum(N), wt = sum(wt)) %>%
ungroup() %>%
select(-Below10, -Index1, -Index2)
return(df2)
}
a2 <- aggregate_fun1(a)
a2
# # A tibble: 9 x 3
# group1 N wt
# <fct> <dbl> <dbl>
# 1 F 0 12.0
# 2 F 12.0 23.0
# 3 F 15.0 45.0
# 4 F 7.00 6.00
# 5 M 9.00 11.0
# 6 M 10.0 8.00
# 7 M 11.0 9.00
# 8 M 12.0 12.0
# 9 M 15.0 27.0
We can apply aggregate_fun1
iteratively until there are no any two or more consecutive rows with N
below 10. We then need a third function, aggregate_fun2
, to aggregate those single rows with N
below 10 to the next or the previous row. Here I designed this function to take the next row as the priority compared to the previous row.
aggregate_fun2 <- function(df){
df2 <- df %>%
mutate(Flag1 = ifelse(N < 10, row_number(), NA)) %>%
mutate(Flag2 = ifelse(is.na(Flag1) & !is.na(lag(Flag1)), lag(Flag1), NA)) %>%
mutate(Flag3 = ifelse(is.na(Flag1) & !is.na(lead(Flag1)), lead(Flag1), NA)) %>%
mutate(Flag4 = coalesce(.$Flag1, .$Flag2, .$Flag3)) %>%
mutate(Flag4 = ifelse(is.na(Flag4), row_number(), Flag4)) %>%
group_by(group1, Flag4) %>%
summarize(N = sum(N), wt = sum(wt)) %>%
ungroup() %>%
select(-Flag4)
return(df2)
}
a3 <- aggregate_fun2(a2)
a3
# # A tibble: 6 x 3
# group1 N wt
# <fct> <dbl> <dbl>
# 1 F 12.0 35.0
# 2 F 22.0 51.0
# 3 M 19.0 19.0
# 4 M 11.0 9.00
# 5 M 12.0 12.0
# 6 M 15.0 27.0
In this example, a3
is the final output.
We can combine all the three function together with a while loop on check_fun
and aggregate_fun1
. If the condition is satisfied, we can then use aggregate_fun2
to calculate the final output. I called this function aggregate_fun
.
aggregate_fun <- function(df){
while(check_fun(df)){
df <- df %>% aggregate_fun1()
}
df2 <- df %>% aggregate_fun2()
return(df2)
}
By applying aggregate_fun
to a
, we can get the output.
aggregate_fun(a)
# # A tibble: 6 x 3
# group1 N wt
# <fct> <dbl> <dbl>
# 1 F 12.0 35.0
# 2 F 22.0 51.0
# 3 M 19.0 19.0
# 4 M 11.0 9.00
# 5 M 12.0 12.0
# 6 M 15.0 27.0
Upvotes: 2