Thomas Philips
Thomas Philips

Reputation: 1089

R dataframe create mulitple new columns from existing columns using across / all_of / mutate_if

I have a dataframe (example below) that has responses to a questionnaire on multiple days.

        > df %>% 
            mutate (Sigma_Bucket_Q1  = if_else(Sigma_Q1 >= Median_Sigma_Q1, 
                    "Above Median Volatility", "Below Median Volatility"))
    # A tibble: 19 x 12
       UserId Days_From_First_Use    Q1    Q2    Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3 Sigma_Bucket_Q1        
       <fct>                <int> <int> <int> <int>    <dbl>    <dbl>    <dbl>           <dbl>           <dbl>           <dbl> <chr>                  
     1 A                        0     3     2     1     1.10    0.837    0.548            1.45            1.59            1.53 Below Median Volatility
     2 A                        1     1     0     0     1.10    0.837    0.548            1.45            1.59            1.53 Below Median Volatility
     3 A                        2     1     1     0     1.10    0.837    0.548            1.45            1.59            1.53 Below Median Volatility
     4 A                        3     0     2     0     1.10    0.837    0.548            1.45            1.59            1.53 Below Median Volatility
     5 A                        4     1     1     1     1.10    0.837    0.548            1.45            1.59            1.53 Below Median Volatility
     6 B                        0     4     8     2     1.26    2.5      2.06             1.45            1.59            1.53 Below Median Volatility
     7 B                        2     2     2     1     1.26    2.5      2.06             1.45            1.59            1.53 Below Median Volatility
     8 B                        4     5     6     5     1.26    2.5      2.06             1.45            1.59            1.53 Below Median Volatility
     9 B                        5     4     5     5     1.26    2.5      2.06             1.45            1.59            1.53 Below Median Volatility
    10 C                        0     5     7     2     1.64    1.87     1                1.45            1.59            1.53 Above Median Volatility
    11 C                        1     2     2     2     1.64    1.87     1                1.45            1.59            1.53 Above Median Volatility
    12 C                        2     5     5     4     1.64    1.87     1                1.45            1.59            1.53 Above Median Volatility
    13 C                        3     6     5     3     1.64    1.87     1                1.45            1.59            1.53 Above Median Volatility
    14 C                        4     6     6     4     1.64    1.87     1                1.45            1.59            1.53 Above Median Volatility
    15 D                        0     5     3     5     2.35    1.30     2.30             1.45            1.59            1.53 Above Median Volatility
    16 D                        1     5     3     4     2.35    1.30     2.30             1.45            1.59            1.53 Above Median Volatility
    17 D                        2     4     2     6     2.35    1.30     2.30             1.45            1.59            1.53 Above Median Volatility
    18 D                        3     0     0     1     2.35    1.30     2.30             1.45            1.59            1.53 Above Median Volatility
    19 D                        4     1     1     1     2.35    1.30     2.30             1.45            1.59            1.53 Above Median Volatility

Columns Q1, Q2 and Q3 have the responses, while Sigma_Q1, Sigma_Q2 and Sigma Q3 have the time series standard deviations of each subject's responses to each question. Median_Sigma_1, Median_Sigma_2, and Median_Sigma_3 have the median standard deviation across subjects for their responses to Q1, Q2 and Q3. I want to classify each subject as an Above Median or Below Median Volatility subject according to whether or not Sigma_Q1 > Median_Sigma_Q1 and so on. My expression for generating Sigma_Bucket_Q1 works just fine; it is visible right before the tibble.

But when I try to generalize it to generate all the Sigma_Buckets simultaneously (my real problem has 21 such names), I run into a problem. I tried:

        df %>% 
  mutate (across(all_of(paste0("Sigma_Bucket_", c("Q1", "Q2", "Q3")) = if_else(paste0("Sigma_", {.col}) >= paste0("Median_Sigma_",  {.col}), 
          "Above Median Volatility", "Below Median Volatility")))

I get a cryptic error message and can't determine what I need to fix:

> df %>% 
+   mutate (across(all_of(paste0("Sigma_Bucket_", c("Q1", "Q2", "Q3")) = if_else(paste0("Sigma_", {.col}) >= paste0("Median_Sigma_",  {.col}), 
Error: unexpected '=' in:
"df %>% 
  mutate (across(all_of(paste0("Sigma_Bucket_", c("Q1", "Q2", "Q3")) ="
>           "Above Median Volatility", "Below Median Volatility")))
Error: unexpected ',' in "          "Above Median Volatility","

How can I modify my statement to do all 3 columns (all 21 in the real problem) without writing one line for each question?

Looking through various answers on StackOverflow suggests that mutate_if might be the basis of a solution, but I don't see how it can be used in this particular setting.

Many thanks in advance for your assistance

Thomas Philips

Upvotes: 0

Views: 41

Answers (2)

andrew_reece
andrew_reece

Reputation: 21284

Here's a solution using map:

map2_df(
    df %>% select(starts_with("Sigma_Q")), 
    df %>% select(starts_with("Median_Sigma_Q")),
    ~if_else(.x >= .y, "Above Median Volatility", "Below Median Volatility")) %>%
  rename_with(~str_replace(.x, "Sigma", "Sigma_Bucket"))

Output:

# A tibble: 19 x 3
   Sigma_Bucket_Q1         Sigma_Bucket_Q2         Sigma_Bucket_Q3        
   <chr>                   <chr>                   <chr>                  
 1 Below Median Volatility Below Median Volatility Below Median Volatility
 2 Below Median Volatility Below Median Volatility Below Median Volatility
 3 Below Median Volatility Below Median Volatility Below Median Volatility
 4 Below Median Volatility Below Median Volatility Below Median Volatility
 5 Below Median Volatility Below Median Volatility Below Median Volatility
 6 Below Median Volatility Above Median Volatility Above Median Volatility
 7 Below Median Volatility Above Median Volatility Above Median Volatility
 8 Below Median Volatility Above Median Volatility Above Median Volatility
 9 Below Median Volatility Above Median Volatility Above Median Volatility
10 Above Median Volatility Above Median Volatility Below Median Volatility
11 Above Median Volatility Above Median Volatility Below Median Volatility
12 Above Median Volatility Above Median Volatility Below Median Volatility
13 Above Median Volatility Above Median Volatility Below Median Volatility
14 Above Median Volatility Above Median Volatility Below Median Volatility
15 Above Median Volatility Below Median Volatility Above Median Volatility
16 Above Median Volatility Below Median Volatility Above Median Volatility
17 Above Median Volatility Below Median Volatility Above Median Volatility
18 Above Median Volatility Below Median Volatility Above Median Volatility
19 Above Median Volatility Below Median Volatility Above Median Volatility

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

across do not have access to column names, they pass only the column values. You can try this vectorized base R approach without any loops.

col1 <- grep('^Sigma_Q\\d$', names(df), value = TRUE)
col2 <- grep('^Median_Sigma_Q\\d$', names(df), value = TRUE)

df[paste0(col1, '_Bucket')] <- c("Below Median Volatility", "Above Median Volatility")[(df[col1] >= df[col2]) + 1]

Upvotes: 1

Related Questions