Geco
Geco

Reputation: 45

Paired data: Variable which is a function of the value of the partner

A dataset consists of couples and singles. Each row represents an individual. Each unique family is identified by the variable family_nr

I'd like to create a new variable result which is a function of the value of the partner of each individual (if there is one).

This can be done using group_byand sum. When the number of rows is high, however, this seems to be rather slow (probably due to sum()on many groups).

library(tidyverse)
family_nr <- c(1,1,2,2,3,3,4)
value_1 <- c(1:7)
df <- data.frame(family_nr, value_1)

df <- df %>% group_by(family_nr) %>% mutate(result = (sum(value_1)-value_1)*5 )

Can anyone suggest a faster alternative?

Upvotes: 1

Views: 30

Answers (1)

akrun
akrun

Reputation: 887213

We could use the data.table method to assign (:=) by reference

library(data.table)
setDT(df)[, result := 5*(sum(value_1) - value_1), family_nr]

Or use ave from base R

with(df, ave(value_1, family_nr, FUN = function(x) 5*(sum(x)- x)))

Benchmarks

set.seed(24)
df1 <- data.frame(family_nr = rep(1:1e6, each =2),
             value_1 = rnorm(1e6*2))

df2 <- copy(df1)

system.time({
   df1 %>% 
       group_by(family_nr) %>%
       mutate(result = 5*(sum(value_1)-value_1) )
    })
# user  system elapsed 
#  33.81    0.09   35.56 

system.time({
    setDT(df2)[, result := 5*(sum(value_1) - value_1), family_nr][]
     })
# user  system elapsed 
#   1.46    0.00    1.47 


system.time({
   with(df1, ave(value_1, family_nr, FUN = function(x) 5*(sum(x)- x)))
   })
#  user  system elapsed 
#   4.92    0.17    5.15 

Upvotes: 1

Related Questions