Reputation: 2833
I am making a large data frame using mutate
with lots of ifelse
conditions. My approach is to not name the columns within mutate because I have many hundreds of these conditions and each time I update one I then have to update them all. Rather I wish to name the columns after the operation outside of mutate
.
Here is some code outlining what Im trying to do
df <- data.frame(a = rnorm(20, 100, 1), b = rnorm(20, 100, 1), c = rnorm(20, 100, 1) )
df2 <- df %>%
mutate(# condition 1
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0),
# condition 2
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
# condition 3
ifelse(a < b, 1, 0),
.keep = 'none'
)
c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names
the trouble is mutate
is truncating the col names of the long ifelse
conditions #condition 1
and #condition 2
and lumping them together as ifelse(...)
so I end up with only 2 columns instead of 3.
Is there something I can do to prevent this behaviour or a more efficient way of achieving what Im try to do. I want to avoid manually typing out hundreds of column names for each condition every time I need to update the df.I would ideally be able to map the identity of the condition back to the new column name. For e.g.
df3 = ifelse(a < b, 1, 0)
This is possible when mutate
doesn't repair the column name
Upvotes: 0
Views: 119
Reputation: 66880
From the first two examples it looks like some of your tests have a common structure and would be better expressed using a function instead of a lot of manually generated code, which is brittle and hard to maintain.
For instance, it seems one structure is Is prior col1 - prior col2 < all entries in some range of prior col3 - col4?
The first and second test are identical except for the range of rows in their tests.
I haven't tested this much so I'm not guaranteeing anything, but I suspect something like this could save you a lot of time and exasperation.
A further refinement could be to make the column name be a concatenation of the parameters, presuming they are unique. For the moment I've just copied @margusl's approach with uuid
.
compare_func <- function(my_df, col1, col2, lag1, col3, col4, lag_min, lag_max) {
my_df %>%
mutate("{uuid::UUIDgenerate()}" := 1*(
lag({{col1}} - {{col2}}, lag1) <
slider::slide_dbl({{col3}} - {{col4}}, ~min(.x, na.rm = TRUE),
.before = lag_min, .after = -lag_min)))
}
Then your code is much simpler, and each test is specified directly by its parameters:
df %>%
compare_func(a, c, 1, a, b, 4, 2) %>%
compare_func(a, c, 1, a, b, 6, 2) %>%
mutate("{uuid::UUIDgenerate()}" := 1*(a < b))
Upvotes: 0
Reputation: 146090
You can wrap the columns in data.frame
, which does not truncate the names so heavily. (The mutate
help page notes that the ...
arguments can be "a data frame or tibble, to create multiple columns in the output.")
df2 <- df %>%
mutate(
data.frame(
# condition 1
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0),
# condition 2
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
# condition 3
ifelse(a < b, 1, 0)
),
.keep = 'none'
)
c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names
df2
# df1 df2 df3
# 1 NA NA 0
# 2 NA NA 0
# 3 NA NA 0
# 4 NA NA 0
# 5 1 NA 0
# 6 0 0 1
# 7 1 1 1
# 8 1 1 0
# 9 0 0 1
# 10 1 1 0
# 11 0 0 0
# 12 0 0 0
# 13 0 0 1
# 14 1 0 1
# 15 1 0 1
# 16 0 0 1
# 17 0 0 0
# 18 0 0 1
# 19 1 1 1
# 20 1 1 1
Upvotes: 1
Reputation: 17564
You could use unique / random column names, UUID for example:
library(dplyr)
set.seed(123)
df <- data.frame(a = rnorm(20, 100, 1), b = rnorm(20, 100, 1), c = rnorm(20, 100, 1))
df2 <- df %>%
mutate(# condition 1
"{uuid::UUIDgenerate()}" :=
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0),
# condition 2
"{uuid::UUIDgenerate()}" :=
ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
(lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
# condition 3
"{uuid::UUIDgenerate()}" :=
ifelse(a < b, 1, 0),
.keep = 'none'
)
str(df2)
#> 'data.frame': 20 obs. of 3 variables:
#> $ 2175b2b7-511f-471a-94d5-d82116b12137: num NA NA NA 0 1 1 0 0 1 1 ...
#> $ 07e353a6-58b9-4c50-9c08-2b7c742cf28b: num NA NA NA 0 NA NA 0 0 1 1 ...
#> $ a4fb004b-f498-4da0-b60b-1fbf872670a5: num 0 1 0 0 0 0 1 1 0 1 ...
c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names
str(df2)
#> 'data.frame': 20 obs. of 3 variables:
#> $ df1: num NA NA NA 0 1 1 0 0 1 1 ...
#> $ df2: num NA NA NA 0 NA NA 0 0 1 1 ...
#> $ df3: num 0 1 0 0 0 0 1 1 0 1 ...
Created on 2024-01-30 with reprex v2.0.2
Upvotes: 1