Reputation: 4949
This is a problem that puzzles me. What is the easiest (and most elegant) way to program a function to use in a mutation section that returns several vectors.
I'll give an example. Let's say I have such a function.
f1 = function(x, n){
y1 = rep(NA, length(x))
y2 = rep(NA, length(x))
y3 = rep(NA, length(x))
y4 = rep(NA, length(x))
for(i in (n+1):(length(x)-n)){
idx = (i-n):(i+n)
y1[i] = sin(mean(x[idx])/max(x[idx]))
y2[i] = cos(mean(x[idx])/max(x[idx]))
y3[i] = tan(mean(x[idx])/max(x[idx]))
y4[i] = 1/tan(mean(x[idx])/max(x[idx]))
}
data.frame(
y1 = y1,
y2 = y2,
y3 = y3,
y4 = y4
)
}
Please do not analyze its mathematical sense, this is just an example. As you can see, this function takes one vector and returns four vectors of the same length. If I would like to use this function in the mutation section, the function will be called four times. Unfortunately, with a long input vector, it will take a long time.
Here is an example.
n = 10000
df = tibble(
key = rep(c("a", "b", "c", "d"), n),
val = rep(rnorm(n), 4)
)
f1test = function(df) df %>%
group_by(key) %>%
mutate(
y1 = f1(val, 100) %>% pull(y1),
y2 = f1(val, 100) %>% pull(y2),
y3 = f1(val, 100) %>% pull(y3),
y4 = f1(val, 100) %>% pull(y4)
)
f1test(df)
When looking for a solution, I had a slightly different idea to return all four vectors at once and then separate them somehow. So I created a second example function that does the same computation, it only differs in the way it returns the result.
f2 = function(x, n){
ret = rep(NA, length(x))
for(i in (n+1):(length(x)-n)){
idx = (i-n):(i+n)
ret[i] = paste(
sin(mean(x[idx])/max(x[idx])),
cos(mean(x[idx])/max(x[idx])),
tan(mean(x[idx])/max(x[idx])),
1/tan(mean(x[idx])/max(x[idx])), sep = ";")
}
ret
}
Using such functions may look like this:
f2test = function(df) df %>%
group_by(key) %>%
mutate(ret = f2(val, 100)) %>%
separate(ret, paste0("y", 1:4), sep=";", convert = TRUE)
f2test(df)
You can see right away that the latter should be faster. And it really is for n = 1000, the version with f2 is approximate 2 times faster.
For n = 10000, it is four times faster.
Now my question. Does anyone know a better (and more elegant) way to solve this problem?
Upvotes: 3
Views: 84
Reputation: 887711
We could use split
and make this more faster`
library(purrr)
map_dfr(split(df$val, df$key), f1, n = 15, .id = 'key')
or use rbindlist
library(data.table)
rbindlist(lapply(split(df$val, df$key), f1, n = 15), idcol = 'key')
#OP's function
> system.time(f1test(df))
user system elapsed
3.589 0.170 3.748
> system.time(map_dfr(split(df$val, df$key), f1, n = 15, .id = 'key'))
user system elapsed
0.606 0.029 0.635
> system.time(rbindlist(lapply(split(df$val, df$key), f1, n = 15), idcol = 'key'))
user system elapsed
0.591 0.024 0.614
Upvotes: 1
Reputation: 160687
However you form your function, there's no need to calculate four times with
mutate(
y1 = f1(val, 100) %>% pull(y1),
y2 = f1(val, 100) %>% pull(y2),
y3 = f1(val, 100) %>% pull(y3),
y4 = f1(val, 100) %>% pull(y4)
)
You can mutate
(without assignment) something that returns a frame:
f1 <- function(x) data.frame(mysin = sin(x), mycos = cos(x), mytan = tan(x))
mtcars %>%
mutate(f1(cyl)) %>%
head
# mpg cyl disp hp drat wt qsec vs am gear carb mysin mycos mytan
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 -0.2794 0.9602 -0.291
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 -0.2794 0.9602 -0.291
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 -0.7568 -0.6536 1.158
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 -0.2794 0.9602 -0.291
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.9894 -0.1455 -6.800
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 -0.2794 0.9602 -0.291
If you want to assign names in the process, then you can do
mtcars %>%
mutate(setNames(f1(cyl), c("A","B","C"))) %>%
head
# mpg cyl disp hp drat wt qsec vs am gear carb A B C
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 -0.2794 0.9602 -0.291
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 -0.2794 0.9602 -0.291
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 -0.7568 -0.6536 1.158
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 -0.2794 0.9602 -0.291
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.9894 -0.1455 -6.800
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 -0.2794 0.9602 -0.291
This is obviously a simplification compared with your f1
, my point is that you can natively use all columns in the returned value without having to figure out how to separate them.
This might make your other function
f1test = function(df) df %>%
group_by(key) %>%
mutate(f1(val, 100))
(with or without setNames
).
Upvotes: 2