Reputation: 688
I have the following dataset in R
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
require(dplyr)
dat <- arrange(dat, t)
The dataset is a panel with t
as the time variable and id
as the subject id. I need to attach an additional row, where I compute the sum of x
times y
for the remaining subjects at time t
and divide it by the standard deviation of the x
variables for the remaining subjects at time t
. This new row should show a zero for the subjects with h == 0
.
For example, for subject A
at time t == 1
, the operation is: (6 * 56 + 11 * 61 + 16 * 66) / sd(c(6, 11, 16))
. A similar operation for subject B
at time t == 1
is (1 * 51 + 11 * 61 + 16 * 66) / sd(c(1, 11, 16))
. However, for subjects C
and D
, the new row would feature only a 0.
What is the fastest way to do this without a loop? I believe the dplyr
package is the fastest, but I'm quite new to it, and I'm unsure on how to deal with it. In my attempt I first group by time, and then gather the variables but I receive a warning and several variables are dropped. I'm unsure on how to select the variables for each group.
dat %>%
group_by(t) %>%
gather(key, value, -t)
# Warning message:
# attributes are not identical across measure variables;
# they will be dropped
CONDITIONING
How to include in the previous operation a condition such that in the following table, the operation is computed only when cond == id
. For example, for the first row we would have: 0 because subjects B
, C
and D
all have values different from their id
(cond
is A
). For row 6 instead the operation is (2*52 + 12*62 + 17*67) / sd(c(2,12,17))
.
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, cond = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat
# t id x y h cond
# 1 1 A 1 51 1 B
# 2 1 B 6 56 1 A
# 3 1 C 11 61 0 A
# 4 1 D 16 66 0 A
# 5 2 A 2 52 1 A
# 6 2 B 7 57 1 B
# 7 2 C 12 62 0 C
# 8 2 D 17 67 0 D
# 9 3 A 3 53 1 A
# 10 3 B 8 58 1 B
# 11 3 C 13 63 0 D
# 12 3 D 18 68 0 C
# 13 4 A 4 54 1 A
# 14 4 B 9 59 1 D
# 15 4 C 14 64 0 C
# 16 4 D 19 69 0 A
# 17 5 A 5 55 1 A
# 18 5 B 10 60 1 C
# 19 5 C 15 65 0 C
# 20 5 D 20 70 0 B
A proposed solution
dat %>%
filter(id == cond) %>%
group_by(t) %>%
mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>%
bind_rows(dat %>% filter(id != cond))
works very well but partially, as it creates NaN
from multiplying 0 * Inf
. Instead I would like to have 0
when the conditions do not apply or when the standard deviation at the denominator is 0
. Thank you so much!
Upvotes: 0
Views: 67
Reputation: 887008
After grouping by 't', create the 'new' column by taking the difference of the sum
of the products of 'x' and 'y' with the product 'x' and 'y' (to exclude the current row product) and dividing it by getting the sd
of elements of 'x' by looping through the row index (row_number()
) to be used for excluding current row and multiply by 'h' so that we get 0 where 'h' is 0.
library(tidyverse)
out <- dat %>%
group_by(t) %>%
mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(),
~ sd(x[-.x]))))
head(out, 4)
# A tibble: 4 x 6
# Groups: t [1]
# t id x y h new
# <dbl> <fct> <int> <int> <dbl> <dbl>
#1 1 A 1 51 1 413.
#2 1 B 6 56 1 233.
#3 1 C 11 61 0 0
#4 1 D 16 66 0 0
Upvotes: 2