gofraidh
gofraidh

Reputation: 683

User defined function with mutate and case_when in R

I would like to know if/how can I turn the call bellow into a function that can be used in a task that I do fairly often with my data. Sadly, I can't figure out how to design function from the call that involves mutate, and case_when, both of these functions rely on dplyr package and require number of additional arguments.

Alternatively, the call itself seems redundant to me with so many case_when, perhaps it's possible to reduce how many times its used.

Any help and information about alternative approaches is welcomed.

The call looks like this:

library(dplyr)
library(stringr)

test_data %>%
  mutate(
    multipleoptions_o1 = case_when(
      str_detect(multipleoptions, "option1") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o2 = case_when(
      str_detect(multipleoptions, "option2") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o3 = case_when(
      str_detect(multipleoptions, "option3") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0),
    multipleoptions_o4 = case_when(
      str_detect(multipleoptions, "option4") ~ 1,
      is.na(multipleoptions) ~ NA_real_,
      TRUE ~ 0)
  )

Sample data:

structure(list(multipleoptions = c("option1", "option2", "option3", 
NA, "option2,option3", "option4")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Desired output of the function:

structure(list(multipleoptions = c("option1", "option2", "option3", 
NA, "option2,option3", "option4"), multipleoptions_o1 = c(1, 
0, 0, NA, 0, 0), multipleoptions_o2 = c(0, 1, 0, NA, 1, 0), multipleoptions_o3 = c(0, 
0, 1, NA, 1, 0), multipleoptions_o4 = c(0, 0, 0, NA, 0, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

Arguments of the function should probably be: data (i.e., input dataset), multipleoptions (i.e., the column from data containing answer options, always one column), patterns_to_look_for (i.e., str_detect patterns to look up in the multipleoptions), number_of_options, ideally the number of options can be more or less than 4, (I am not sure if it's achievable), output_columns (i.e., names of new columns, it's always name or original column followed by the option number or option name).

Upvotes: 3

Views: 1900

Answers (1)

eipi10
eipi10

Reputation: 93761

You can avoid the lengthy case_when code by splitting the options into separate elements, taking advantage of nesting/unnesting to get a single column of options, and then spreading to get a separate column for each option.

Updated Answer

library(tidyverse)

# Arguments
# data     A data frame
# patterns Regular expression giving the pattern(s) at which to split the options strings
# ...      Grouping columns, the first of which must be the "options" column.
#           If options has repeated values, then there must be a second grouping 
#           column (an "ID" column) to differentiate these repeated values.
fnc = function(data, patterns, ...) {
  col = quos(...)

  data %>% 
    mutate(option=str_split(!!!col[[1]], patterns)) %>% 
    unnest %>% 
    mutate(value=1) %>% 
    group_by(!!!col) %>% 
    mutate(num_chosen = ifelse(is.na(!!!col[[1]]), 0, sum(value))) %>% 
    spread(option, value, fill=0) %>%
    select_at(vars(-matches("NA")))
}

fnc(test_data, ",", multipleoptions)
  multipleoptions num_chosen option1 option2 option3 option4
1         option1          1       1       0       0       0
2         option2          1       0       1       0       0
3 option2,option3          2       0       1       1       0
4         option3          1       0       0       1       0
5         option4          1       0       0       0       1
6            <NA>          0       0       0       0       0
# Fake data
ops = paste0("option",1:4)

set.seed(2)
d = data_frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=","))) 
# Add missing values
d = bind_rows(d[1:5,], data.frame(var=rep(NA,3)), d[6:nrow(d),])

fnc(d %>% mutate(ID=1:n()), ",", var, ID)
                               var ID num_chosen option1 option2 option3 option4
1                          option1 17          1       1       0       0       0
2                  option1,option2 12          2       1       1       0       0
3          option1,option2,option3  5          3       1       1       1       0
4  option1,option2,option4,option3  9          4       1       1       1       1
5                  option1,option3  2          2       1       0       1       0
6          option1,option3,option4  3          3       1       0       1       1
7          option1,option4,option2 20          3       1       1       0       1
8  option1,option4,option3,option2 13          4       1       1       1       1
9                          option2 11          1       0       1       0       0
10                 option2,option3 23          2       0       1       1       0
11         option2,option3,option4 21          3       0       1       1       1
12                         option3  1          1       0       0       1       0
13                         option3 15          1       0       0       1       0
14                 option3,option1  4          2       1       0       1       0
15         option3,option2,option4 14          3       0       1       1       1
16 option3,option4,option2,option1 22          4       1       1       1       1
17                         option4 10          1       0       0       0       1
18                         option4 16          1       0       0       0       1
19                         option4 18          1       0       0       0       1
20         option4,option2,option3 19          3       0       1       1       1
21                            <NA>  6          0       0       0       0       0
22                            <NA>  7          0       0       0       0       0
23                            <NA>  8          0       0       0       0       0

Original Answer

test_data %>% 
  filter(!is.na(multipleoptions)) %>% 
  mutate(option=str_split(multipleoptions, ",")) %>% 
  unnest %>% 
  mutate(value=1) %>% 
  spread(option, value)
  multipleoptions option1 option2 option3 option4
  <chr>             <dbl>   <dbl>   <dbl>   <dbl>
1 option1               1      NA      NA      NA
2 option2              NA       1      NA      NA
3 option2,option3      NA       1       1      NA
4 option3              NA      NA       1      NA
5 option4              NA      NA      NA       1

Packaging this into a function:

fnc = function(data, col, patterns) {
  col = enquo(col)

  data %>% 
    filter(!is.na(!!col)) %>% 
    mutate(option=str_split(!!col, patterns)) %>% 
    unnest %>% 
    mutate(value=1) %>% 
    spread(option, value)
}


fnc(test_data, multipleoptions, ",")

If your real data has more than one row with the same value of multipleoptons, then this code will work only if there's also an ID column that distinguishes different rows with the same value of multipleoptions. For example:

# Fake data
ops = paste0("option",1:4)

set.seed(2)
d = data.frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))

fnc(d, var, ",")

Error: Duplicate identifiers for rows (1, 27), (16, 28, 30)

# Add unique row identifier
fnc(d %>% mutate(ID = 1:n()), var, ",")
                               var ID option1 option2 option3 option4
1                          option1 14       1      NA      NA      NA
2                  option1,option2  9       1       1      NA      NA
3          option1,option2,option3  5       1       1       1      NA
4  option1,option2,option4,option3  6       1       1       1       1
5                  option1,option3  2       1      NA       1      NA
6          option1,option3,option4  3       1      NA       1       1
7          option1,option4,option2 17       1       1      NA       1
8  option1,option4,option3,option2 10       1       1       1       1
9                          option2  8      NA       1      NA      NA
10                 option2,option3 20      NA       1       1      NA
11         option2,option3,option4 18      NA       1       1       1
12                         option3  1      NA      NA       1      NA
13                         option3 12      NA      NA       1      NA
14                 option3,option1  4       1      NA       1      NA
15         option3,option2,option4 11      NA       1       1       1
16 option3,option4,option2,option1 19       1       1       1       1
17                         option4  7      NA      NA      NA       1
18                         option4 13      NA      NA      NA       1
19                         option4 15      NA      NA      NA       1
20         option4,option2,option3 16      NA       1       1       1

Upvotes: 4

Related Questions