Marek Fiołka
Marek Fiołka

Reputation: 4949

A function that returns several vectors in the mutate section

This is a problem that puzzles me. What is the easiest (and most elegant) way to program a function to use in a mutation section that returns several vectors.

I'll give an example. Let's say I have such a function.

f1 = function(x, n){
  y1 = rep(NA, length(x))
  y2 = rep(NA, length(x))
  y3 = rep(NA, length(x))
  y4 = rep(NA, length(x))
  for(i in (n+1):(length(x)-n)){
    idx = (i-n):(i+n)
    y1[i] = sin(mean(x[idx])/max(x[idx]))
    y2[i] = cos(mean(x[idx])/max(x[idx]))
    y3[i] = tan(mean(x[idx])/max(x[idx]))
    y4[i] = 1/tan(mean(x[idx])/max(x[idx]))
  }
  data.frame(
    y1 = y1,
    y2 = y2,
    y3 = y3,
    y4 = y4
  )
}

Please do not analyze its mathematical sense, this is just an example. As you can see, this function takes one vector and returns four vectors of the same length. If I would like to use this function in the mutation section, the function will be called four times. Unfortunately, with a long input vector, it will take a long time.

Here is an example.

n = 10000
df = tibble(
  key = rep(c("a", "b", "c", "d"), n),
  val = rep(rnorm(n), 4)
)

f1test = function(df) df %>% 
  group_by(key) %>% 
  mutate(
    y1 = f1(val, 100) %>% pull(y1),
    y2 = f1(val, 100) %>% pull(y2),
    y3 = f1(val, 100) %>% pull(y3),
    y4 = f1(val, 100) %>% pull(y4)
  )
f1test(df)

When looking for a solution, I had a slightly different idea to return all four vectors at once and then separate them somehow. So I created a second example function that does the same computation, it only differs in the way it returns the result.

f2 = function(x, n){
  ret = rep(NA, length(x))
  for(i in (n+1):(length(x)-n)){
    idx = (i-n):(i+n)
    ret[i] = paste(
      sin(mean(x[idx])/max(x[idx])),
      cos(mean(x[idx])/max(x[idx])),
      tan(mean(x[idx])/max(x[idx])),
      1/tan(mean(x[idx])/max(x[idx])), sep = ";")
  }
  ret
}

Using such functions may look like this:

f2test = function(df) df %>% 
  group_by(key) %>% 
  mutate(ret = f2(val, 100)) %>% 
  separate(ret, paste0("y", 1:4), sep=";", convert = TRUE)  
f2test(df)

You can see right away that the latter should be faster. And it really is for n = 1000, the version with f2 is approximate 2 times faster.

For n = 10000, it is four times faster.

Now my question. Does anyone know a better (and more elegant) way to solve this problem?

Upvotes: 3

Views: 84

Answers (2)

akrun
akrun

Reputation: 887711

We could use split and make this more faster`

library(purrr)
map_dfr(split(df$val, df$key), f1, n = 15, .id = 'key')

or use rbindlist

library(data.table)
rbindlist(lapply(split(df$val, df$key), f1, n = 15), idcol = 'key')

Benchmarks

#OP's function
> system.time(f1test(df))
   user  system elapsed 
  3.589   0.170   3.748 
> system.time(map_dfr(split(df$val, df$key), f1, n = 15, .id = 'key'))
   user  system elapsed 
  0.606   0.029   0.635 
> system.time(rbindlist(lapply(split(df$val, df$key), f1, n = 15), idcol = 'key'))
   user  system elapsed 
  0.591   0.024   0.614 

Upvotes: 1

r2evans
r2evans

Reputation: 160687

However you form your function, there's no need to calculate four times with

mutate(
    y1 = f1(val, 100) %>% pull(y1),
    y2 = f1(val, 100) %>% pull(y2),
    y3 = f1(val, 100) %>% pull(y3),
    y4 = f1(val, 100) %>% pull(y4)
  )

You can mutate (without assignment) something that returns a frame:

f1 <- function(x) data.frame(mysin = sin(x), mycos = cos(x), mytan = tan(x))
mtcars %>%
  mutate(f1(cyl)) %>%
  head
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb   mysin   mycos  mytan
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 -0.2794  0.9602 -0.291
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 -0.2794  0.9602 -0.291
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 -0.7568 -0.6536  1.158
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 -0.2794  0.9602 -0.291
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  0.9894 -0.1455 -6.800
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 -0.2794  0.9602 -0.291

If you want to assign names in the process, then you can do

mtcars %>%
  mutate(setNames(f1(cyl), c("A","B","C"))) %>%
  head
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb       A       B      C
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 -0.2794  0.9602 -0.291
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 -0.2794  0.9602 -0.291
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 -0.7568 -0.6536  1.158
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 -0.2794  0.9602 -0.291
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  0.9894 -0.1455 -6.800
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 -0.2794  0.9602 -0.291

This is obviously a simplification compared with your f1, my point is that you can natively use all columns in the returned value without having to figure out how to separate them.

This might make your other function

f1test = function(df) df %>% 
  group_by(key) %>% 
  mutate(f1(val, 100))

(with or without setNames).

Upvotes: 2

Related Questions