anonymous1a
anonymous1a

Reputation: 1270

dplyr tidyr – How to generate case_when with dynamic conditons?

Is there a way to dynamically/programmatically generate case_when conditions in dplyr with different column names and/or different numbers of conditions? I have an interactive script that I'm trying to convert into a function. There's a lot of repeated code in the case_when statements and I'm wondering if it can be automated somehow without my needing to write everything from scratch again and again.

Here's a dummy dataset:

test_df = tibble(low_A=c(5, 15, NA),
                 low_TOT=c(NA, 10, NA),
                 low_B=c(20, 25, 30),
                 high_A=c(NA, NA, 10),
                 high_TOT=c(NA, 40, NA),
                 high_B=c(60, 20, NA))

expected_df = tibble(low_A=c(5, 15, NA),
                     low_TOT=c(NA, 10, NA),
                     low_B=c(20, 25, 30),
                     ans_low=c(5, 10, 30),
                     high_A=c(NA, NA, 10),
                     high_TOT=c(NA, 40, NA),
                     high_B=c(60, 20, NA),
                     ans_high=c(60, 40, 10))

> expected_df
# A tibble: 3 x 8
  low_A low_TOT low_B ans_low high_A high_TOT high_B ans_high
  <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
1     5      NA    20       5     NA       NA     60       60
2    15      10    25      10     NA       40     20       40
3    NA      NA    30      30     10       NA     NA       10

The logic I want is that if the ._TOT column has a value, use that. If not, then try column ._A, and if not, then column ._B. Note that I intentionally didn't put ._TOT as the first column for a group. I could just use coalesce() in that case, but I want a general solution irrespective of column order.

Of course, all of this is easy to do with a couple of case_when statements. My issues are that:

  1. I'm trying to make a general function and so don't want interactive/tidy evaluation.
  2. I have a whole bunch of columns like this. All ending with one of _TOT, _A, _B but with different prefixes (e.g., low_TOT, low_A, low_B, high_TOT, high_A, high_B,..... and I don't want to rewrite a bunch of case_when functions again and again.

What I have right now looks like this (where I'm writing a case_when for each prefix):

def my_function = function(df) { 
    df %>% mutate(
          # If a total low doesn't exist, use A (if exists) or B (if exists)
          "ans_low" := case_when(
            !is.na(.data[["low_TOT"]]) ~ .data[["low_TOT"]],
            !is.na(.data[["low_A"]]) ~ .data[["low_A"]],
            !is.na(.data[["low_B"]]) ~ .data[["low_B"]],
          ),

          # If a total high doesn't exist, use A (if exists) or B (if exists)
          "ans_high" := case_when(
            !is.na(.data[["high_TOT"]]) ~ .data[["high_TOT"]],
            !is.na(.data[["high_A"]]) ~ .data[["high_R"]],
            !is.na(.data[["high_B"]]) ~ .data[["high_B"]],
              
         # Plus a whole bunch of similar case_when functions...
}

And what I'd like is to ideally get a way to dynamically generate case_when functions with different conditions so that I'm not writing a new case_when each time by exploiting the fact that:

  1. All the three conditions have the same general form, and the same structure for the variable names, but with a different prefix (high_, low_, etc.).
  2. They have the same formula of the form !is.na( .data[[ . ]]) ~ .data[[ . ]], where the dot(.) is the dynamically generated name of the column.

What I'd like is something like:

def my_function = function(df) { 
    df %>% mutate(
          "ans_low" := some_func(prefix="Low"),
          "ans_high" := some_func(prefix="High")
}

I tried creating my own case_when generator to replace the standard case_when as shown below, but I'm getting an error. I'm guessing that's because .data doesn't really work outside of the tidyverse functions?

some_func = function(prefix) {
  case_when(
    !is.na(.data[[ sprintf("%s_TOT", prefix) ]]) ~ .data[[ sprintf("%s_TOT", prefix) ]],
    !is.na(.data[[ sprintf("%s_A", prefix) ]]) ~ .data[[ sprintf("%s_A", prefix) ]],
    !is.na(.data[[ sprintf("%s_B", prefix) ]]) ~ .data[[ sprintf("%s_B", prefix) ]]
  )
}

Something else I'm curious about is making an even more general case_when generator. In the examples thus far, it's only the names (prefix) of the columns that are changing. What if I wanted to

  1. change the number and names of suffixes (e.g., high_W, high_X, high_Y, high_Z, low_W, low_X, low_Y, low_Z, .......) and so make a character vector of suffixes an argument of some_func
  2. change the form of the formula. Right now, it's of the form !is.na(.data[[ . ]]) ~ .data[[ . ]] for all the conditions, but what if I wanted to make this an argument of some_func? For example, !is.na(.data[[ . ]]) ~ sprintf("%s is missing", .)

I'd be happy with just getting it to work with different prefixes but it'd be very cool to understand how I could achieve something even more general with arbitrary (but common) suffixes and arbitrary formulae such that I can do some_func(prefix, suffixes, formula).

Upvotes: 9

Views: 2213

Answers (6)

Anoushiravan R
Anoushiravan R

Reputation: 21908

Updated Solution I think this solution solely based on base R may help you.

fn <- function(data) {
  
  do.call(cbind, lapply(unique(gsub("([[:alpha:]]+)_.*", "\\1", names(test_df))), function(x) {
    tmp <- test_df[paste0(x, c("_TOT", "_A", "_B"))]
    tmp[[paste(x, "ans", sep = "_")]] <- Reduce(function(a, b) {
      i <- which(is.na(a))
      a[i] <- b[i]
      a
    }, tmp)
    tmp
  }))
}

fn(test_df)

fn(test_df)

   high_TOT high_A high_B high_ans low_TOT low_A low_B low_ans
1       NA     NA     60       60      NA     5    20       5
2       40     NA     20       40      10    15    25      10
3       NA     10     NA       10      NA    NA    30      30

Upvotes: 6

AnilGoyal
AnilGoyal

Reputation: 26218

Though the answer has been accepted, I feel this can be done (even for any number of column sets) in dplyr only without the need of writing a custom function earlier.

test_df %>%
  mutate(across(ends_with('_TOT'), ~ coalesce(., 
                                              get(gsub('_TOT', '_A', cur_column())), 
                                              get(gsub('_TOT', '_B', cur_column()))
                                              ),
                .names = "ans_{gsub('_TOT', '', .col)}"))

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10

A complete base R approach

Reduce(function(.x, .y) {
  xx <- .x[paste0(.y, c('_TOT', '_A', '_B'))]
  .x[[paste0('ans_',.y)]] <- apply(xx, 1, \(.z) head(na.omit(.z), 1))
  .x
}, unique(gsub('([_]*)_.*', '\\1', names(test_df))),
init = test_df)

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10

Upvotes: 1

IceCreamToucan
IceCreamToucan

Reputation: 28675

This does not generate any case_when, but you can create the two new columns as follows. Of course this could also be a function with test_df, ans_order, and and_groups as arguments.

ans_order <- c('TOT', 'A', 'B')
ans_groups <- c('low', 'high')

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) do.call(dplyr::coalesce, test_df[x]))

test_df
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

If you'd rather not use any packages, another option is

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) Reduce(function(x, y) ifelse(is.na(x), y, x), test_df[x]))

Upvotes: 2

anonymous1a
anonymous1a

Reputation: 1270

Thanks for all your answers folks! Calum You's answer specifically made me realise that sticking to the Tidyverse all the time isn't necessarily the best and sometimes base R has a better, simpler, and more elegant solution.

Thans to a ton of searching and this excellent post by noahm on the RStduio community, I was also able to come up with a solution of my own that does what I was looking for:

library(tidyverse)
library(rlang)
library(glue)

make_expr = function(prefix, suffix) {
  rlang::parse_expr(glue::glue('!is.na(.data[[\"{prefix}_{suffix}\"]]) ~ .data[[\"{prefix}_{suffix}\"]]'))
}

make_conds = function(prefixes, suffixes){
  map2(prefixes, suffixes, make_expr)
}

ans_df = test_df %>%  
    mutate(
        "ans_low" := case_when(
            !!! make_conds( prefixes=c("low"), suffixes=c("TOT", "A", "B") ) 
        ),
        "ans_high" := case_when(
            !!! make_conds( prefixes=c("high"), suffixes=c("TOT", "A", "B") ) 
        )
    )

# The ans is the same as the expected solution
> all_equal(ans_df, expected_df)
[1] TRUE

I've also checked that this works inside of a function (which was another important consideration for me).

One benefits of this solution is that the suffixes are not hard-coded and achieve at least the first level of generality I was looking for.

I imagine some string manipulation with replacements could possibly also allow for generality with the structure of the formulae. Ultimately, general formulae would require a string templating solution of some sort because with this structure, you can just keep that into glue.

Upvotes: 3

TimTeaFan
TimTeaFan

Reputation: 18551

Here is a custom case_when function that you can call with purrr::reduce and a vector of strings parts of your variable names (in the example c("low", "high"):

library(dplyr)
library(purrr)

my_case_when <- function(df, x) {
  
  mutate(df,
         "ans_{x}" := case_when(
           !is.na(!! sym(paste0(x, "_TOT"))) ~ !! sym(paste0(x, "_TOT")),
           !is.na(!! sym(paste0(x, "_A"))) ~ !! sym(paste0(x, "_A")),
           !is.na(!! sym(paste0(x, "_B"))) ~ !! sym(paste0(x, "_B"))
           )
  )
}

test_df %>% 
  reduce(c("low", "high"), my_case_when, .init = .)

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

I also have a package on Github {dplyover} which is made for this kind of cases. For your example with more than two variables I would use dplyover::over together with a special syntax to evaluate strings as variable names. We can further use dplyover::cut_names("_TOT") to extract the string parts of the variable names that come before or after "_TOT" (in the example this is "low" and "high").

We can either use case_when:

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ case_when(
                  !is.na(.("{.x}_TOT")) ~ .("{.x}_TOT"),
                  !is.na(.("{.x}_A")) ~ .("{.x}_A"),
                  !is.na(.("{.x}_B")) ~ .("{.x}_B")
                  )),
              .names = "{fn}_{x}")
         )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Or somewhat easier coalesce:

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ coalesce(.("{.x}_TOT"),
                                    .("{.x}_A"),
                                    .("{.x}_B"))),
              .names = "{fn}_{x}")
  )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

Upvotes: 8

Calum You
Calum You

Reputation: 15062

At the risk of not answering the question, I think the easiest way to approach this is to just reshape and use coalesce(). Your data structure requires two pivots either way (I think) but this requires no careful thinking about what prefixes are present.

library(tidyverse)

test_df <- tibble(
  low_A = c(5, 15, NA),
  low_TOT = c(NA, 10, NA),
  low_B = c(20, 25, 30),
  high_A = c(NA, NA, 10),
  high_TOT = c(NA, 40, NA),
  high_B = c(60, 20, NA)
)

test_df %>%
  rowid_to_column() %>%
  pivot_longer(cols = -rowid, names_to = c("prefix", "suffix"), names_sep = "_") %>%
  pivot_wider(names_from = suffix, values_from = value) %>%
  mutate(ans = coalesce(TOT, A, B)) %>%
  pivot_longer(cols = c(-rowid, -prefix), names_to = "suffix") %>%
  pivot_wider(names_from = c(prefix, suffix), names_sep = "_", values_from = value)
#> # A tibble: 3 x 9
#>   rowid low_A low_TOT low_B low_ans high_A high_TOT high_B high_ans
#>   <int> <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
#> 1     1     5      NA    20       5     NA       NA     60       60
#> 2     2    15      10    25      10     NA       40     20       40
#> 3     3    NA      NA    30      30     10       NA     NA       10

Note also that case_when has no tidy evaluation, and so just not using mutate simplifies your some_func a lot. You already got an answer using !!sym inside mutate, so here is a version that illustrates a simpler way. I prefer not to use tidyeval unless necessary because I want to use a mutate chain, and here it's not really needed.

some_func <- function(df, prefix) {
  ans <- str_c(prefix, "_ans")
  TOT <- df[[str_c(prefix, "_TOT")]]
  A <- df[[str_c(prefix, "_A")]]
  B <- df[[str_c(prefix, "_B")]]
  
  df[[ans]] <- case_when(
    !is.na(TOT) ~ TOT,
    !is.na(A) ~ A,
    !is.na(B) ~ B
  )
  df
}

reduce(c("low", "high"), some_func, .init = test_df)
#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B low_ans high_ans
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Upvotes: 6

Related Questions