lethalSinger
lethalSinger

Reputation: 616

Applying a custom function repeatedly to same dataframe using purrr

Suppose I have a dataframe as follows:

df <- data.frame(
  alpha = 0:20,
  beta = 30:50,
  gamma = 100:120
)

I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:

newfun <- function(var = NULL) {
  newname <- paste0(var, "NEW")
  df[[newname]] <- df[[var]]/100
  return(df)
}

I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:

df <- newfun("alpha") 
df <- newfun("beta") 
df <- newfun("gamma")

Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:

df <- data.frame(
  alpha = 0:20,
  beta = 30:50,
  gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
      map(function(x) newfun(x)) 

How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?

Upvotes: 2

Views: 414

Answers (3)

www
www

Reputation: 39174

Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.

vars <- names(df)

for (i  in vars){
  df <- newfun(i)
}
df
#    alpha beta gamma alphaNEW betaNEW gammaNEW
# 1      0   30   100     0.00    0.30     1.00
# 2      1   31   101     0.01    0.31     1.01
# 3      2   32   102     0.02    0.32     1.02
# 4      3   33   103     0.03    0.33     1.03
# 5      4   34   104     0.04    0.34     1.04
# 6      5   35   105     0.05    0.35     1.05
# 7      6   36   106     0.06    0.36     1.06
# 8      7   37   107     0.07    0.37     1.07
# 9      8   38   108     0.08    0.38     1.08
# 10     9   39   109     0.09    0.39     1.09
# 11    10   40   110     0.10    0.40     1.10
# 12    11   41   111     0.11    0.41     1.11
# 13    12   42   112     0.12    0.42     1.12
# 14    13   43   113     0.13    0.43     1.13
# 15    14   44   114     0.14    0.44     1.14
# 16    15   45   115     0.15    0.45     1.15
# 17    16   46   116     0.16    0.46     1.16
# 18    17   47   117     0.17    0.47     1.17
# 19    18   48   118     0.18    0.48     1.18
# 20    19   49   119     0.19    0.49     1.19
# 21    20   50   120     0.20    0.50     1.20

Upvotes: 0

Cole
Cole

Reputation: 11255

An alternative approach is to change your function to only return a vector:

newfun2 <- function(var = NULL) {
  df[[var]] / 100
}

newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20

Then, using base, you can use lapply() to loop through your list of functions to do:

cols <- c("alpha", "beta", "gamma")

df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or 
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df

   alpha beta gamma alphaNEW betaNEW gammaNEW
1      0   30   100     0.00    0.30     1.00
2      1   31   101     0.01    0.31     1.01
3      2   32   102     0.02    0.32     1.02
4      3   33   103     0.03    0.33     1.03
5      4   34   104     0.04    0.34     1.04
6      5   35   105     0.05    0.35     1.05
7      6   36   106     0.06    0.36     1.06
8      7   37   107     0.07    0.37     1.07
9      8   38   108     0.08    0.38     1.08
10     9   39   109     0.09    0.39     1.09
11    10   40   110     0.10    0.40     1.10
12    11   41   111     0.11    0.41     1.11
13    12   42   112     0.12    0.42     1.12
14    13   43   113     0.13    0.43     1.13
15    14   44   114     0.14    0.44     1.14
16    15   45   115     0.15    0.45     1.15
17    16   46   116     0.16    0.46     1.16
18    17   47   117     0.17    0.47     1.17
19    18   48   118     0.18    0.48     1.18
20    19   49   119     0.19    0.49     1.19
21    20   50   120     0.20    0.50     1.20

Upvotes: 1

r2evans
r2evans

Reputation: 161085

Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:

df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable

You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:

  1. While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
  2. Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.

To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".

myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
#  [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
#  [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
#  [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20

From here, we just need to column-bind (cbind) it:

cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
#    alpha beta gamma newalpha newbeta newgamma
# 1      0   30   100     0.00    0.30     1.00
# 2      1   31   101     0.01    0.31     1.01
# 3      2   32   102     0.02    0.32     1.02
# 4      3   33   103     0.03    0.33     1.03
# 5      4   34   104     0.04    0.34     1.04
# ...

Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)

additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
#  $ :List of 3
#   ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
#   ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
#   ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
#  $ :List of 3
#   ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
#   ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
#   ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
#  $ :List of 3
#   ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
#   ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
#   ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...

do.call(cbind, c(list(dat), additions))
#    alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1      0   30   100      0.00     0.30      1.00      0.00     0.30      1.00      0.00     0.30      1.00
# 2      1   31   101      0.01     0.31      1.01      0.01     0.31      1.01      0.01     0.31      1.01
# 3      2   32   102      0.02     0.32      1.02      0.02     0.32      1.02      0.02     0.32      1.02
# 4      3   33   103      0.03     0.33      1.03      0.03     0.33      1.03      0.03     0.33      1.03
# 5      4   34   104      0.04     0.34      1.04      0.04     0.34      1.04      0.04     0.34      1.04
# 6      5   35   105      0.05     0.35      1.05      0.05     0.35      1.05      0.05     0.35      1.05
# ...

Upvotes: 2

Related Questions