boshek
boshek

Reputation: 4406

Aggregate with adjacent group if value falls below a threshold

I am trying to figure out a way to aggregate levels of a group creating a new level based on a threshold value of what you are aggregating.

Create some data:

library(tidyr)
library(dplyr)

demo_data <- as_tibble(VADeaths) %>% 
  mutate(age_bucket = row.names(VADeaths)) %>% 
  pivot_longer(-age_bucket) %>% 
  arrange(name)

Here are a bunch of values below our threshold (say 15 here)

demo_data %>% 
  filter(value < 15)
#> # A tibble: 5 x 3
#>   age_bucket name         value
#>   <chr>      <chr>        <dbl>
#> 1 50-54      Rural Female   8.7
#> 2 55-59      Rural Female  11.7
#> 3 50-54      Rural Male    11.7
#> 4 50-54      Urban Female   8.4
#> 5 55-59      Urban Female  13.6

Now I can use some logic to do this with case_when but this seems fragile because it is so specific. This does, however, illustrate what I am after:

demo_data %>% 
  mutate(age_bucket_agg = case_when(
    age_bucket %in% c("50-54", "55-59") & name == "Rural Female" ~ "50-59",
    age_bucket %in% c("50-54", "55-59") & name == "Urban Female" ~ "50-59",
    age_bucket %in% c("50-54", "55-59") & name == "Rural Male" ~ "50-59",
    TRUE ~ age_bucket
  )
  ) %>% 
  group_by(age_bucket_agg, name) %>% 
  summarise(value = sum(value))
#> `summarise()` regrouping output by 'age_bucket_agg' (override with `.groups` argument)
#> # A tibble: 17 x 3
#> # Groups:   age_bucket_agg [6]
#>    age_bucket_agg name         value
#>    <chr>          <chr>        <dbl>
#>  1 50-54          Urban Male    15.4
#>  2 50-59          Rural Female  20.4
#>  3 50-59          Rural Male    29.8
#>  4 50-59          Urban Female  22  
#>  5 55-59          Urban Male    24.3
#>  6 60-64          Rural Female  20.3
#>  7 60-64          Rural Male    26.9
#>  8 60-64          Urban Female  19.3
#>  9 60-64          Urban Male    37  
#> 10 65-69          Rural Female  30.9
#> 11 65-69          Rural Male    41  
#> 12 65-69          Urban Female  35.1
#> 13 65-69          Urban Male    54.6
#> 14 70-74          Rural Female  54.3
#> 15 70-74          Rural Male    66  
#> 16 70-74          Urban Female  50  
#> 17 70-74          Urban Male    71.1

My question is can anyone think of an automated way of doing this? How can I tell dplyr (or R in general) to take all values below as threshold and add them to the next age_bucket and then recode that grouping level to take the lowest value and the biggest value and create a new range.

Upvotes: 5

Views: 622

Answers (4)

TimTeaFan
TimTeaFan

Reputation: 18541

I think your example is a bit too minimal for this really challenging question. I added some challenges to your data which I think the approaches of the other answers can't tackle yet. My approach is quite verbose. Essentially, it checks every logical combination / direction in which age buckets could be merged and then recursively merges the age buckets until the threshold is met or until there are no other age buckets left to merge together. With a bit more work we could turn this into a more general function.

library(tidyverse)

demo_data <- as_tibble(VADeaths) %>% 
  mutate(age_bucket = row.names(VADeaths)) %>% 
  pivot_longer(-age_bucket) %>% 
  arrange(name) %>% 
  # lets add more challenges to the data
  mutate(value = case_when(
    age_bucket == "55-59" & name == "Rural Female" ~ 2,
    age_bucket == "70-74" & name == "Rural Male" ~ 13,
    age_bucket == "65-69" & name == "Urban Female" ~ 8,
    age_bucket == "70-74" & name == "Urban Male" ~ 3,
    T ~ value))

# function that implements merging age buckets
merge_impl <- function(x) {
  
  if(any(x$first)) {
    e <- filter(x, first == 1)
    
    if (e$id & !is.na(e$age_max_lead)) {
      out <- mutate(x,
                     age_max = if_else(first,
                                       age_max_lead,
                                       age_max),
                     value = if_else(first,
                                     value + value_lead,
                                     value)) 
      out <- filter(out, !lag(first, default = FALSE))

      
    } else if (e$id & is.na(e$age_max_lead & !is.na(e$age_min_lag))) {
      out <- mutate(x,
                     age_min = if_else(first,
                                       age_min_lag,
                                       age_min),
                     value = if_else(first,
                                     value + value_lag,
                                     value))
      out <- filter(out, !lead(first, default = FALSE))
      
    } else if (e$id & is.na(e$age_max_lead & is.na(e$age_min_lag))) {
      out <- x
    } else if (!e$id & !is.na(e$age_min_lag)) {
      out <- mutate(x,
                     age_min = if_else(first,
                                       age_min_lag,
                                       age_min),
                     value = if_else(first,
                                     value + value_lag,
                                     value)) 
      out <- filter(out, !lead(first, default = FALSE))

    } else if (!e$id & is.na(e$age_min_lag) & !is.na(e$age_max_lead)) {
      out <- mutate(x,
                     age_max = if_else(first,
                                       age_max_lead,
                                       age_max),
                     value = if_else(first,
                                     value + value_lead,
                                     value)) %>% 
        out <- filter(out, !lag(first, default = FALSE))

    } else if (!e$id & is.na(e$age_min_lag) & is.na(e$age_max_lead)) {
      out <- x
    }
  } else { 
    out <- x
  }

  select(out,
         -contains("lead"), -contains("lag"),
         -first, -id)
}

merge_age_buckets <- function(x, threshold) {
  
  # initialize
  data_ls <-
    x %>% 
    separate(age_bucket,
             c("age_min", "age_max"),
             convert = TRUE) %>% 
    group_by(name) %>% 
    mutate(across(c(age_min, age_max, value),
                    list(lead = ~ lead(.x),
                         lag  = ~ lag(.x))
                   )
    ) %>% 
    mutate(id = age_min %% 10 == 0,
           first = value < threshold & cumsum(value < threshold) == 1) %>% 
    group_split 

   # check & proceed
   if(any(map_lgl(data_ls, ~ any(.x$first & nrow(.x) > 1)))) {
     res <- map_dfr(data_ls, merge_impl) %>% 
       mutate(age_bucket = paste0(age_min, "-", age_max)) %>% 
       select(- c(age_min, age_max))
     # if result still needs adjustment repeat
     if(any(res$value < threshold)) {
       merge_age_buckets(res, threshold = threshold)
     } else {
       return(res)
       }
   } else {
     out <- reduce(data_ls, bind_rows) %>% 
       mutate(age_buckets = paste0(age_min, "-", age_max)) %>% 
       select(- c(age_min, age_max))
     return(out)
   }
}
 
merge_age_buckets(demo_data, 15)
#> # A tibble: 13 x 3
#>    name         value age_bucket
#>    <chr>        <dbl> <chr>     
#>  1 Rural Female  31   50-64     
#>  2 Rural Female  30.9 65-69     
#>  3 Rural Female  54.3 70-74     
#>  4 Rural Male    29.8 50-59     
#>  5 Rural Male    26.9 60-64     
#>  6 Rural Male    54   65-74     
#>  7 Urban Female  22   50-59     
#>  8 Urban Female  27.3 60-69     
#>  9 Urban Female  50   70-74     
#> 10 Urban Male    15.4 50-54     
#> 11 Urban Male    24.3 55-59     
#> 12 Urban Male    37   60-64     
#> 13 Urban Male    57.6 65-74

Created on 2020-06-23 by the reprex package (v0.3.0)

Upvotes: 1

tjebo
tjebo

Reputation: 23727

With a mix of cumsum and rle, (here using data.table::rleid, but you can also use base::rle)

library(tidyr)
library(dplyr)

demo_data <- as_tibble(VADeaths) %>% 
  mutate(age_bucket = as.factor(row.names(VADeaths))) %>% #factorise to get the levels right
  pivot_longer(-age_bucket) %>% 
  arrange(name, age_bucket) #added this to sort

thresh <- 15

demo_data %>% 
  group_by(name) %>%
  mutate(rle_val = data.table::rleid(value < thresh),
         min_nonconsec = which.min(c(1, diff(rle_val) != 1)),
         newbuck = cumsum(row_number() > min_nonconsec)) %>%
  group_by(name, newbuck) %>%
  summarise(newname = paste(age_bucket, collapse = "-"),
            newbucket = paste(unlist(strsplit(newname, "-"))[1], tail(unlist(strsplit(newname, "-")),1), sep = "-"),
            newval = sum(value)
  ) %>%
  select(-newname)

#> `summarise()` regrouping output by 'name' (override with `.groups` argument)
#> # A tibble: 15 x 4
#> # Groups:   name [4]
#>    name         newbuck newbucket newval
#>    <chr>          <int> <chr>      <dbl>
#>  1 Rural Female       0 50-64       40.7
#>  2 Rural Female       1 65-69       30.9
#>  3 Rural Female       2 70-74       54.3
#>  4 Rural Male         0 50-59       29.8
#>  5 Rural Male         1 60-64       26.9
#>  6 Rural Male         2 65-69       41  
#>  7 Rural Male         3 70-74       66  
#>  8 Urban Female       0 50-64       41.3
#>  9 Urban Female       1 65-69       35.1
#> 10 Urban Female       2 70-74       50  
#> 11 Urban Male         0 50-54       15.4
#> 12 Urban Male         1 55-59       24.3
#> 13 Urban Male         2 60-64       37  
#> 14 Urban Male         3 65-69       54.6
#> 15 Urban Male         4 70-74       71.1

Created on 2020-06-20 by the reprex package (v0.3.0)

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

Not sure if I understand the requirements correctly after TimTeaFan’s comments, here is approach in data.table:

library(data.table)
DT <- setDT(reshape2::melt(VADeaths, id.vars=NULL)) 
DT[, c("low", "high") := lapply(tstrsplit(Var1, "-"), as.integer)]

DT[value < 15, c("low","high") := .(min(low), max(high)), Var2]

DT[, sum(value), .(low, high, Var2)]

Upvotes: 0

Martin Gal
Martin Gal

Reputation: 16978

Here is a unneccessary complicated way using dplyr and stringr:

demo_data %>% 
  group_by(name) %>% 
  mutate(csum = cumsum(value),
         min_split  = ifelse(value<15, as.numeric(str_split(age_bucket[value<15], "-", simplify = TRUE))[1], NA),
         max_split  = ifelse(value<15, as.numeric(str_split(age_bucket[min(which(csum>15))], "-", simplify = TRUE))[2], NA),
         age_bucket = ifelse(value<15, str_c(min_split, "-", max_split), age_bucket),
         value      = ifelse(value<15, csum[min(which(csum>15))], value)) %>%
  select(-min_split, -max_split, -csum) %>%
  distinct() %>%
  arrange(age_bucket)

which yields

# A tibble: 18 x 3
# Groups:   name [4]
   age_bucket name         value
   <chr>      <chr>        <dbl>
 1 50-54      Urban Male    15.4
 2 50-59      Rural Female  20.4
 3 50-59      Rural Male    29.8
 4 50-59      Urban Female  22  
 5 55-59      Rural Male    18.1
 6 55-59      Urban Male    24.3
 7 60-64      Rural Female  20.3
 8 60-64      Rural Male    26.9
 9 60-64      Urban Female  19.3
10 60-64      Urban Male    37  
11 65-69      Rural Female  30.9
12 65-69      Rural Male    41  
13 65-69      Urban Female  35.1
14 65-69      Urban Male    54.6
15 70-74      Rural Female  54.3
16 70-74      Rural Male    66  
17 70-74      Urban Female  50  
18 70-74      Urban Male    71.1

Upvotes: 0

Related Questions