Tanmoy Chatterjee
Tanmoy Chatterjee

Reputation: 21

Passing column names to user defined function inside mutate_at

I am struggling to pass column names inside my custom function while using dplyr - mutate_at. I have a dataset "dt" with thousands of columns and I would like to perform mutate for some of these columns, but in a way which is dependent on the column name

I have this piece of code

Option 1:

relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
   #write different logic per column name
}
dt%>%
  mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))

I tried approaching the problem in another way, i.e by iterating over relevantcols and applying mutate_at for each of the elements of the vector as follows

Option 2:

for (i in 1:length(relevantcols)){
  dt%>%
  mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}

I get the colnames in Option 2, but it is 10 times slower than Option 1. Can I get somehow the column names in Option 1?

Adding an example for more clarity

df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
 employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     5.234383     6.857227     4.480943     7.233947
2        2     7.441399     3.777524     2.134075     6.310293
3        3     7.686558     8.598688     9.814882     9.192952
4        4     6.033345     5.658716     5.167388     3.018563
5        5     5.595006     7.582548     9.302917     6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with  the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     T     F     4.480943     7.233947
2        2     T     F     2.134075     6.310293
3        3     F     F     9.814882     9.192952
4        4     T     T     5.167388     3.018563
5        5     F     T     9.302917     6.071108

Upvotes: 2

Views: 1238

Answers (3)

TimTeaFan
TimTeaFan

Reputation: 18581

This is an old question, but I just stumbled over one possible way to solve it using a custom mutate/case_when function in combination with purrr::reduce.

It's important to use non-standard evaluation (NSE) inside the mutate/case_when statement to match the variable names you need for your custom function.

I do not know a way to do something similar with mutate_at.

Below I provide two examples, the most basic form (using your original data), and a more advanced version (which contains three weekdays and two channels and) which creates more than two variables. The latter requires an initial set-up using, for example, switch.

Basic example

library(tidyverse)

# your data
df <- data.frame(employee=seq(1:5),
                Mon_channelA=runif(5,1,10),
                Mon_channelB=runif(5,1,10),
                Tue_channelA=runif(5,1,10),
                Tue_channelB=runif(5,1,10)
                )

# custom function which takes two arguments, df and a string variable name
myfunc <- function(df, x) {

  mutate(df,
         # overwrites all "Mon_channel" variables ...
         !! paste0("Mon_", x) := case_when(  
           # ... with TRUE, when Mon_channel is smaller than Tue_channel, and FALSE else
            !! sym(paste0("Mon_", x)) < !! sym(paste0("Tue_", x))  ~ T,
            T ~ F
           )
         )
}

# define the variables you want to loop over
var_ls <- c("channelA", "channelB")

# use var_ls and myfunc with reduce on your data
df %>% 
  reduce(var_ls, myfunc, .init = .)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1        FALSE        FALSE     3.437975     2.458389
#> 2        2        FALSE         TRUE     3.686903     4.772390
#> 3        3         TRUE         TRUE     5.158234     5.378021
#> 4        4         TRUE         TRUE     5.338950     3.109760
#> 5        5         TRUE        FALSE     6.365173     3.450495

Created on 2020-02-03 by the reprex package (v0.3.0)

More advanced example

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2

# your data plus one weekday with two channels
df <- data.frame(employee=seq(1:5),
                Mon_channelA=runif(5,1,10),
                Mon_channelB=runif(5,1,10),
                Tue_channelA=runif(5,1,10),
                Tue_channelB=runif(5,1,10),
                Wed_channelA=runif(5,1,10),
                Wed_channelB=runif(5,1,10)
                )

# custom function which takes two argument, df and a string variable name
myfunc <- function(df, x) {

  # an initial set-up is needed

  # id gets the original day
  id <- str_extract(x, "^\\w{3}")

  # based on id the day of comparison is mapped with switch
  y <- switch(id,
              "Mon" = "Tue",
              "Tue" = "Wed")

  # j extracts the channel name including the underscore
  j <- str_extract(x, "_channel[A-Z]{1}")

  # this makes the function definition rather easy:
  mutate(df,
         !! x := case_when(  
            !! sym(x) < !! sym(paste0(y, j))  ~ T,
            T ~ F
           )
         )
}

# define the variables you want to loop over
var_ls <- c("Mon_channelA",
            "Mon_channelB",
            "Tue_channelA",
            "Tue_channelB")

# use var_ls and myfunc with reduce on your data
df %>% 
  reduce(var_ls, myfunc, .init = .)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1         TRUE         TRUE         TRUE        FALSE
#> 2        2        FALSE         TRUE         TRUE        FALSE
#> 3        3        FALSE         TRUE        FALSE         TRUE
#> 4        4        FALSE         TRUE         TRUE        FALSE
#> 5        5         TRUE        FALSE        FALSE        FALSE
#>   Wed_channelA Wed_channelB
#> 1     9.952454     5.634686
#> 2     9.356577     4.514683
#> 3     2.721330     7.107316
#> 4     4.410240     2.740289
#> 5     5.394057     4.772162

Created on 2020-02-03 by the reprex package (v0.3.0)

Upvotes: 0

camille
camille

Reputation: 16871

I left a comment about data types, but assuming that that is what you're looking for, here's the approach I take to these sorts of problems. I do this in a seemingly convoluted process of reshaping a few times, but it lets you set up the variables that you're trying to compare without hard-coding much. I'll break it into pieces.

library(tidyverse)

set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))

First, I'd reshape it into a long shape and break the "Mon_channelA", etc apart into a day and a channel. This lets you use the channel designation to match values for comparison.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  head()
#>   employee day  channel    value
#> 1        1 Mon channelA 2.039619
#> 2        2 Mon channelA 8.153684
#> 3        3 Mon channelA 9.027932
#> 4        4 Mon channelA 1.161967
#> 5        5 Mon channelA 3.583353
#> 6        1 Mon channelB 7.102797

Then, bring it back into a wide format based on the days. Now you have a column for each day for each combination of employee and channel.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  head()
#>   employee  channel      Mon      Tue
#> 1        1 channelA 2.039619 9.826677
#> 2        1 channelB 7.102797 7.388568
#> 3        2 channelA 8.153684 5.848375
#> 4        2 channelB 6.299178 9.452274
#> 5        3 channelA 9.027932 5.458906
#> 6        3 channelB 7.029408 7.087011

Then do your comparison, and take the data long again. Note that because the value column has numeric values, everything becomes numeric and the logical values are converted to 1 or 0.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  head()
#>   employee  channel day value
#> 1        1 channelA Mon     0
#> 2        1 channelB Mon     0
#> 3        2 channelA Mon     1
#> 4        2 channelB Mon     0
#> 5        3 channelA Mon     1
#> 6        3 channelB Mon     0

Last few steps are to stick the day and channel back together to make the labels as you had them, spread back to a wide format, and turn all the columns starting with "Mon" back into logicals.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  unite("variable", day, channel) %>%
  spread(key = variable, value = value) %>%
  mutate_at(vars(starts_with("Mon")), as.logical)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1        FALSE        FALSE     9.826677     7.388568
#> 2        2         TRUE        FALSE     5.848375     9.452274
#> 3        3         TRUE        FALSE     5.458906     7.087011
#> 4        4        FALSE        FALSE     8.854263     8.946458
#> 5        5        FALSE        FALSE     6.933054     8.450741

Created on 2018-09-28 by the reprex package (v0.2.1)

Upvotes: 1

Nicolas2
Nicolas2

Reputation: 2210

You can do things like :

L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
#  A B  C
#1 1 1  7
#2 2 2  8
#3 3 3  9
#4 1 4 10
#5 2 5 11
#6 3 6 12

f <- function(x,y) x^y

df %>% mutate_at(L,funs(f(.,2)))
#  A  B  C
#1 1  1  7
#2 4  4  8
#3 9  9  9
#4 1 16 10
#5 4 25 11
#6 9 36 12

Upvotes: 0

Related Questions