Speeding up data.frame operations instead of looping

Question

I have the following dataset in R

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) ) 
require(dplyr)
dat <- arrange(dat, t)

The dataset is a panel with t as the time variable and id as the subject id. I need to attach an additional row, where I compute the sum of x times y for the remaining subjects at time t and divide it by the standard deviation of the x variables for the remaining subjects at time t. This new row should show a zero for the subjects with h == 0.

For example, for subject A at time t == 1, the operation is: (6 * 56 + 11 * 61 + 16 * 66) / sd(c(6, 11, 16)). A similar operation for subject B at time t == 1 is (1 * 51 + 11 * 61 + 16 * 66) / sd(c(1, 11, 16)). However, for subjects C and D, the new row would feature only a 0.

What is the fastest way to do this without a loop? I believe the dplyr package is the fastest, but I'm quite new to it, and I'm unsure on how to deal with it. In my attempt I first group by time, and then gather the variables but I receive a warning and several variables are dropped. I'm unsure on how to select the variables for each group.

dat %>%
  group_by(t) %>%
  gather(key, value, -t)
# Warning message:
# attributes are not identical across measure variables;
# they will be dropped

CONDITIONING

How to include in the previous operation a condition such that in the following table, the operation is computed only when cond == id. For example, for the first row we would have: 0 because subjects B, C and D all have values different from their id (cond is A). For row 6 instead the operation is (2*52 + 12*62 + 17*67) / sd(c(2,12,17)).

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, cond = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat

#    t  id x y  h   cond
# 1  1  A  1 51 1    B
# 2  1  B  6 56 1    A
# 3  1  C 11 61 0    A
# 4  1  D 16 66 0    A
# 5  2  A  2 52 1    A
# 6  2  B  7 57 1    B
# 7  2  C 12 62 0    C
# 8  2  D 17 67 0    D
# 9  3  A  3 53 1    A
# 10 3  B  8 58 1    B
# 11 3  C 13 63 0    D
# 12 3  D 18 68 0    C
# 13 4  A  4 54 1    A
# 14 4  B  9 59 1    D
# 15 4  C 14 64 0    C
# 16 4  D 19 69 0    A
# 17 5  A  5 55 1    A
# 18 5  B 10 60 1    C
# 19 5  C 15 65 0    C
# 20 5  D 20 70 0    B

A proposed solution

dat %>% 
 filter(id == cond) %>% 
 group_by(t) %>% 
 mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>% 
 bind_rows(dat %>% filter(id != cond))

works very well but partially, as it creates NaN from multiplying 0 * Inf. Instead I would like to have 0 when the conditions do not apply or when the standard deviation at the denominator is 0. Thank you so much!

akrun · Accepted Answer

After grouping by 't', create the 'new' column by taking the difference of the sum of the products of 'x' and 'y' with the product 'x' and 'y' (to exclude the current row product) and dividing it by getting the sd of elements of 'x' by looping through the row index (row_number()) to be used for excluding current row and multiply by 'h' so that we get 0 where 'h' is 0.

library(tidyverse)
out <- dat %>% 
         group_by(t) %>% 
         mutate(new =  h * ((sum(x *y) - (x * y))/map_dbl(row_number(),
                                                     ~ sd(x[-.x]))))
head(out, 4)
# A tibble: 4 x 6
# Groups:   t [1]
#      t id        x     y     h   new
#       
#1     1 A         1    51     1  413.
#2     1 B         6    56     1  233.
#3     1 C        11    61     0    0 
#4     1 D        16    66     0    0

Speeding up data.frame operations instead of looping

Answers (1)

Related Questions