ThallyHo
ThallyHo

Reputation: 2833

stop mutate truncating column column names

I am making a large data frame using mutate with lots of ifelse conditions. My approach is to not name the columns within mutate because I have many hundreds of these conditions and each time I update one I then have to update them all. Rather I wish to name the columns after the operation outside of mutate.

Here is some code outlining what Im trying to do

df <- data.frame(a = rnorm(20, 100, 1), b = rnorm(20, 100, 1), c = rnorm(20, 100, 1) )

df2 <- df %>%
    mutate(# condition 1
           ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0), 
           # condition 2
           ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
           # condition 3
           ifelse(a < b, 1, 0),
           .keep = 'none'
           )

c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names

the trouble is mutate is truncating the col names of the long ifelse conditions #condition 1 and #condition 2 and lumping them together as ifelse(...) so I end up with only 2 columns instead of 3.

Is there something I can do to prevent this behaviour or a more efficient way of achieving what Im try to do. I want to avoid manually typing out hundreds of column names for each condition every time I need to update the df.I would ideally be able to map the identity of the condition back to the new column name. For e.g.

df3 = ifelse(a < b, 1, 0)

This is possible when mutate doesn't repair the column name

Upvotes: 0

Views: 119

Answers (3)

Jon Spring
Jon Spring

Reputation: 66880

From the first two examples it looks like some of your tests have a common structure and would be better expressed using a function instead of a lot of manually generated code, which is brittle and hard to maintain.

For instance, it seems one structure is Is prior col1 - prior col2 < all entries in some range of prior col3 - col4? The first and second test are identical except for the range of rows in their tests.

I haven't tested this much so I'm not guaranteeing anything, but I suspect something like this could save you a lot of time and exasperation.

A further refinement could be to make the column name be a concatenation of the parameters, presuming they are unique. For the moment I've just copied @margusl's approach with uuid.

compare_func <- function(my_df, col1, col2, lag1, col3, col4, lag_min, lag_max) {
  my_df %>%
    mutate("{uuid::UUIDgenerate()}" := 1*(
      lag({{col1}} - {{col2}}, lag1) < 
        slider::slide_dbl({{col3}} - {{col4}}, ~min(.x, na.rm = TRUE),
                          .before = lag_min, .after = -lag_min)))
}

Then your code is much simpler, and each test is specified directly by its parameters:

df %>%
  compare_func(a, c, 1, a, b, 4, 2) %>%
  compare_func(a, c, 1, a, b, 6, 2) %>%
  mutate("{uuid::UUIDgenerate()}" := 1*(a < b))

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 146090

You can wrap the columns in data.frame, which does not truncate the names so heavily. (The mutate help page notes that the ... arguments can be "a data frame or tibble, to create multiple columns in the output.")

df2 <- df %>%
    mutate(
      data.frame(
           # condition 1
           ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0), 
           # condition 2
           ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
                  (lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
           # condition 3
           ifelse(a < b, 1, 0)
          ),
          .keep = 'none'
        )

c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names
df2
#    df1 df2 df3
# 1   NA  NA   0
# 2   NA  NA   0
# 3   NA  NA   0
# 4   NA  NA   0
# 5    1  NA   0
# 6    0   0   1
# 7    1   1   1
# 8    1   1   0
# 9    0   0   1
# 10   1   1   0
# 11   0   0   0
# 12   0   0   0
# 13   0   0   1
# 14   1   0   1
# 15   1   0   1
# 16   0   0   1
# 17   0   0   0
# 18   0   0   1
# 19   1   1   1
# 20   1   1   1

Upvotes: 1

margusl
margusl

Reputation: 17564

You could use unique / random column names, UUID for example:

library(dplyr)
set.seed(123)
df <- data.frame(a = rnorm(20, 100, 1), b = rnorm(20, 100, 1), c = rnorm(20, 100, 1))

df2 <- df %>%
  mutate(# condition 1
    "{uuid::UUIDgenerate()}" := 
      ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)), 1, 0), 
    # condition 2
    "{uuid::UUIDgenerate()}" := 
      ifelse((lag(a, 1) - lag(c, 1)) < (lag(a, 2) - lag(b, 2)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 3) - lag(b, 3)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 4) - lag(b, 4)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 5) - lag(b, 5)) &
             (lag(a, 1) - lag(c, 1)) < (lag(a, 6) - lag(b, 6)), 1, 0),
    # condition 3
    "{uuid::UUIDgenerate()}" := 
      ifelse(a < b, 1, 0),
    .keep = 'none'
  )
str(df2)
#> 'data.frame':    20 obs. of  3 variables:
#>  $ 2175b2b7-511f-471a-94d5-d82116b12137: num  NA NA NA 0 1 1 0 0 1 1 ...
#>  $ 07e353a6-58b9-4c50-9c08-2b7c742cf28b: num  NA NA NA 0 NA NA 0 0 1 1 ...
#>  $ a4fb004b-f498-4da0-b60b-1fbf872670a5: num  0 1 0 0 0 0 1 1 0 1 ...

c_names <- paste('df', rep(1:ncol(df2), 1), sep = '')
colnames(df2) <- c_names

str(df2)
#> 'data.frame':    20 obs. of  3 variables:
#>  $ df1: num  NA NA NA 0 1 1 0 0 1 1 ...
#>  $ df2: num  NA NA NA 0 NA NA 0 0 1 1 ...
#>  $ df3: num  0 1 0 0 0 0 1 1 0 1 ...

Created on 2024-01-30 with reprex v2.0.2

Upvotes: 1

Related Questions