Reputation: 45
A dataset consists of couples and singles. Each row represents an individual. Each unique family is identified by the variable family_nr
I'd like to create a new variable result
which is a function of the value of the partner of each individual (if there is one).
This can be done using group_by
and sum
. When the number of rows is high, however, this seems to be rather slow (probably due to sum()
on many groups).
library(tidyverse)
family_nr <- c(1,1,2,2,3,3,4)
value_1 <- c(1:7)
df <- data.frame(family_nr, value_1)
df <- df %>% group_by(family_nr) %>% mutate(result = (sum(value_1)-value_1)*5 )
Can anyone suggest a faster alternative?
Upvotes: 1
Views: 30
Reputation: 887213
We could use the data.table
method to assign (:=
) by reference
library(data.table)
setDT(df)[, result := 5*(sum(value_1) - value_1), family_nr]
Or use ave
from base R
with(df, ave(value_1, family_nr, FUN = function(x) 5*(sum(x)- x)))
set.seed(24)
df1 <- data.frame(family_nr = rep(1:1e6, each =2),
value_1 = rnorm(1e6*2))
df2 <- copy(df1)
system.time({
df1 %>%
group_by(family_nr) %>%
mutate(result = 5*(sum(value_1)-value_1) )
})
# user system elapsed
# 33.81 0.09 35.56
system.time({
setDT(df2)[, result := 5*(sum(value_1) - value_1), family_nr][]
})
# user system elapsed
# 1.46 0.00 1.47
system.time({
with(df1, ave(value_1, family_nr, FUN = function(x) 5*(sum(x)- x)))
})
# user system elapsed
# 4.92 0.17 5.15
Upvotes: 1