Reputation: 21
I am struggling to pass column names inside my custom function while using dplyr - mutate_at. I have a dataset "dt" with thousands of columns and I would like to perform mutate for some of these columns, but in a way which is dependent on the column name
I have this piece of code
Option 1:
relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
#write different logic per column name
}
dt%>%
mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))
I tried approaching the problem in another way, i.e by iterating over relevantcols and applying mutate_at for each of the elements of the vector as follows
Option 2:
for (i in 1:length(relevantcols)){
dt%>%
mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}
I get the colnames in Option 2, but it is 10 times slower than Option 1. Can I get somehow the column names in Option 1?
Adding an example for more clarity
df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 5.234383 6.857227 4.480943 7.233947
2 2 7.441399 3.777524 2.134075 6.310293
3 3 7.686558 8.598688 9.814882 9.192952
4 4 6.033345 5.658716 5.167388 3.018563
5 5 5.595006 7.582548 9.302917 6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 T F 4.480943 7.233947
2 2 T F 2.134075 6.310293
3 3 F F 9.814882 9.192952
4 4 T T 5.167388 3.018563
5 5 F T 9.302917 6.071108
Upvotes: 2
Views: 1238
Reputation: 18581
This is an old question, but I just stumbled over one possible way to solve it using a custom mutate/case_when
function in combination with purrr::reduce
.
It's important to use non-standard evaluation (NSE) inside the mutate/case_when
statement to match the variable names you need for your custom function.
I do not know a way to do something similar with mutate_at
.
Below I provide two examples, the most basic form (using your original data), and a more advanced version (which contains three weekdays and two channels and) which creates more than two variables. The latter requires an initial set-up using, for example, switch
.
library(tidyverse)
# your data
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10)
)
# custom function which takes two arguments, df and a string variable name
myfunc <- function(df, x) {
mutate(df,
# overwrites all "Mon_channel" variables ...
!! paste0("Mon_", x) := case_when(
# ... with TRUE, when Mon_channel is smaller than Tue_channel, and FALSE else
!! sym(paste0("Mon_", x)) < !! sym(paste0("Tue_", x)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("channelA", "channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 FALSE FALSE 3.437975 2.458389
#> 2 2 FALSE TRUE 3.686903 4.772390
#> 3 3 TRUE TRUE 5.158234 5.378021
#> 4 4 TRUE TRUE 5.338950 3.109760
#> 5 5 TRUE FALSE 6.365173 3.450495
Created on 2020-02-03 by the reprex package (v0.3.0)
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2
# your data plus one weekday with two channels
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10),
Wed_channelA=runif(5,1,10),
Wed_channelB=runif(5,1,10)
)
# custom function which takes two argument, df and a string variable name
myfunc <- function(df, x) {
# an initial set-up is needed
# id gets the original day
id <- str_extract(x, "^\\w{3}")
# based on id the day of comparison is mapped with switch
y <- switch(id,
"Mon" = "Tue",
"Tue" = "Wed")
# j extracts the channel name including the underscore
j <- str_extract(x, "_channel[A-Z]{1}")
# this makes the function definition rather easy:
mutate(df,
!! x := case_when(
!! sym(x) < !! sym(paste0(y, j)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("Mon_channelA",
"Mon_channelB",
"Tue_channelA",
"Tue_channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE TRUE FALSE TRUE
#> 4 4 FALSE TRUE TRUE FALSE
#> 5 5 TRUE FALSE FALSE FALSE
#> Wed_channelA Wed_channelB
#> 1 9.952454 5.634686
#> 2 9.356577 4.514683
#> 3 2.721330 7.107316
#> 4 4.410240 2.740289
#> 5 5.394057 4.772162
Created on 2020-02-03 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 16871
I left a comment about data types, but assuming that that is what you're looking for, here's the approach I take to these sorts of problems. I do this in a seemingly convoluted process of reshaping a few times, but it lets you set up the variables that you're trying to compare without hard-coding much. I'll break it into pieces.
library(tidyverse)
set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
First, I'd reshape it into a long shape and break the "Mon_channelA
", etc apart into a day and a channel. This lets you use the channel designation to match values for comparison.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
head()
#> employee day channel value
#> 1 1 Mon channelA 2.039619
#> 2 2 Mon channelA 8.153684
#> 3 3 Mon channelA 9.027932
#> 4 4 Mon channelA 1.161967
#> 5 5 Mon channelA 3.583353
#> 6 1 Mon channelB 7.102797
Then, bring it back into a wide format based on the days. Now you have a column for each day for each combination of employee and channel.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
head()
#> employee channel Mon Tue
#> 1 1 channelA 2.039619 9.826677
#> 2 1 channelB 7.102797 7.388568
#> 3 2 channelA 8.153684 5.848375
#> 4 2 channelB 6.299178 9.452274
#> 5 3 channelA 9.027932 5.458906
#> 6 3 channelB 7.029408 7.087011
Then do your comparison, and take the data long again. Note that because the value
column has numeric values, everything becomes numeric and the logical values are converted to 1 or 0.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
mutate(Mon = Mon > Tue) %>%
gather(key = day, value = value, Mon, Tue) %>%
head()
#> employee channel day value
#> 1 1 channelA Mon 0
#> 2 1 channelB Mon 0
#> 3 2 channelA Mon 1
#> 4 2 channelB Mon 0
#> 5 3 channelA Mon 1
#> 6 3 channelB Mon 0
Last few steps are to stick the day and channel back together to make the labels as you had them, spread back to a wide format, and turn all the columns starting with "Mon"
back into logicals.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
mutate(Mon = Mon > Tue) %>%
gather(key = day, value = value, Mon, Tue) %>%
unite("variable", day, channel) %>%
spread(key = variable, value = value) %>%
mutate_at(vars(starts_with("Mon")), as.logical)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 FALSE FALSE 9.826677 7.388568
#> 2 2 TRUE FALSE 5.848375 9.452274
#> 3 3 TRUE FALSE 5.458906 7.087011
#> 4 4 FALSE FALSE 8.854263 8.946458
#> 5 5 FALSE FALSE 6.933054 8.450741
Created on 2018-09-28 by the reprex package (v0.2.1)
Upvotes: 1
Reputation: 2210
You can do things like :
L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
# A B C
#1 1 1 7
#2 2 2 8
#3 3 3 9
#4 1 4 10
#5 2 5 11
#6 3 6 12
f <- function(x,y) x^y
df %>% mutate_at(L,funs(f(.,2)))
# A B C
#1 1 1 7
#2 4 4 8
#3 9 9 9
#4 1 16 10
#5 4 25 11
#6 9 36 12
Upvotes: 0