Reputation: 688
I have a very large dataset that can be described by the following R
code:
set.seed(1)
data <- data.frame(id = rep(c(rep(1,5), rep(2,5)),2), h = rep(1:2,10),
d = c(rep(1,10), rep(2,10)), t = rep(c(sample(c(1,2,3), 5, replace = T),
sample(c(1,2,3), 5, replace = T)),2), q = runif(20), p = runif(20),
b = runif(20), w = rep(c(rep(.1,2), rep(.2,2)),5))
where id
, is the subject id and h
and d
are hours and days. Each id
has multiple t
. For each type of t
(which can be 1, 2, or 3), indexed by t_i
, in each h
and d
, I need to compute the summation over the t = t_i
of q_i * pnorm((p_i - b_i)/ w_i)
. The output should ideally be an additional column to the data.frame
data
.
What is the fastest way to compute this? My dataset is very large, and I'm afraid that a for loop
or sapply
will take forever. I was thinking of using the aggregate
function but I am not sure it works with expressions other than mean
or sum
.
---- FIXED A TYPO IN THE DGP ----
Example: Given the data (set.seed(1)
) The last column give the output while Sum
gives the multiplication q_i * pnorm((p_i - b_i)/ w_i)
by row. Rows 1 and 4 have same conditioning variables and thus have the same value in the result
column because this would be equal to q_1 * pnorm((p_1 - b_1)/ w_1) +
q_5 * pnorm((p_5 - b_5)/ w_5)``. I'M SORRY IF I WAS NOT CLEAR.
id h d t q p b w Sum result
1.1 1 1 1 1 0.20597457 0.4820801 0.47761962 0.1 1.066513e-01 8.764928e-01
2.2 1 2 1 2 0.17655675 0.5995658 0.86120948 0.1 7.843788e-04 7.843788e-04
2.3 1 1 1 2 0.68702285 0.4935413 0.43809711 0.2 4.185307e-01 4.185307e-01
3.4 1 2 1 3 0.38410372 0.1862176 0.24479728 0.2 1.478031e-01 1.478031e-01
1.5 1 1 1 1 0.76984142 0.8273733 0.07067905 0.1 7.698414e-01 8.764928e-01
3.6 2 2 1 3 0.49769924 0.6684667 0.09946616 0.1 4.976992e-01 4.976992e-01
3.7 2 1 1 3 0.71761851 0.7942399 0.31627171 0.2 7.115705e-01 7.115705e-01
2.8 2 2 1 2 0.99190609 0.1079436 0.51863426 0.2 1.985233e-02 1.985233e-02
2.9 2 1 1 2 0.38003518 0.7237109 0.66200508 0.1 2.779585e-01 2.779585e-01
1.10 2 2 1 1 0.77744522 0.4112744 0.40683019 0.1 4.025021e-01 4.025021e-01
1.11 1 1 2 1 0.93470523 0.8209463 0.91287592 0.2 3.018017e-01 3.745763e-01
2.12 1 2 2 2 0.21214252 0.6470602 0.29360337 0.2 2.039559e-01 2.039559e-01
2.13 1 1 2 2 0.65167377 0.7829328 0.45906573 0.1 6.512825e-01 6.512825e-01
3.14 1 2 2 3 0.12555510 0.5530363 0.33239467 0.1 1.238378e-01 1.238378e-01
1.15 1 1 2 1 0.26722067 0.5297196 0.65087047 0.2 7.277459e-02 3.745763e-01
3.16 2 2 2 3 0.38611409 0.7893562 0.25801678 0.2 3.845907e-01 3.845907e-01
3.17 2 1 2 3 0.01339033 0.0233312 0.47854525 0.1 3.555325e-08 3.555325e-08
2.18 2 2 2 2 0.38238796 0.4772301 0.76631067 0.1 7.346728e-04 7.346728e-04
2.19 2 1 2 2 0.86969085 0.7323137 0.08424691 0.2 8.691717e-01 8.691717e-01
1.20 2 2 2 1 0.34034900 0.6927316 0.87532133 0.2 6.147884e-02 6.147884e-02
Upvotes: 0
Views: 75
Reputation: 61154
Here's an R base approach using ave
. I am not sure if this solves your problem, but here's an attempt for you to figure out how to write the expression:
data$Aggregated.Sum <- ave(data[, c("q", "p", "b", "w")],
data[,c("id", "h", "d", "t")],
FUN=function(x){
sum(x$q * pnorm((x$p - x$b)/ x$w))
})[, 1]
Giving the following output:
id h d t q p b w Aggregated.Sum
1 1 1 1 1 0.20597457 0.4820801 0.47761962 0.1 0.87649276750008
2 1 2 1 2 0.17655675 0.5995658 0.86120948 0.1 0.00078437884830
3 1 1 1 2 0.68702285 0.4935413 0.43809711 0.2 0.41853074006211
4 1 2 1 3 0.38410372 0.1862176 0.24479728 0.2 0.14780307785185
5 1 1 1 1 0.76984142 0.8273733 0.07067905 0.1 0.87649276750008
6 2 2 1 3 0.49769924 0.6684667 0.09946616 0.1 0.49769923892396
7 2 1 1 3 0.71761851 0.7942399 0.31627171 0.2 0.71157053468214
8 2 2 1 2 0.99190609 0.1079436 0.51863426 0.2 0.01985232872822
9 2 1 1 2 0.38003518 0.7237109 0.66200508 0.1 0.27795848822967
10 2 2 1 1 0.77744522 0.4112744 0.40683019 0.1 0.40250214883360
11 1 1 2 1 0.93470523 0.8209463 0.91287592 0.2 0.37457632092225
12 1 2 2 2 0.21214252 0.6470602 0.29360337 0.2 0.20395587133630
13 1 1 2 2 0.65167377 0.7829328 0.45906573 0.1 0.65128247418102
14 1 2 2 3 0.12555510 0.5530363 0.33239467 0.1 0.12383782495223
15 1 1 2 1 0.26722067 0.5297196 0.65087047 0.2 0.37457632092225
16 2 2 2 3 0.38611409 0.7893562 0.25801678 0.2 0.38459067414776
17 2 1 2 3 0.01339033 0.0233312 0.47854525 0.1 0.00000003555325
18 2 2 2 2 0.38238796 0.4772301 0.76631067 0.1 0.00073467275547
19 2 1 2 2 0.86969085 0.7323137 0.08424691 0.2 0.86917168501551
20 2 2 2 1 0.34034900 0.6927316 0.87532133 0.2 0.06147884477362
Upvotes: 1
Reputation: 28371
We group_by(t, id, h, d)
then calculate the Sum
for each row then finally calculate the total sum result
for rows that have the same t, id, h, d
library(tidyverse)
set.seed(1)
options(scipen = 999)
dat <- data.frame(id = rep(c(rep(1,5), rep(2,5)),2), h = rep(1:2,5),
d = c(rep(1,10), rep(2,10)),
t = rep(c(sample(c(1,2,3), 5, replace = T),
sample(c(1,2,3), 5, replace = T)),2), q = runif(20), p = runif(20),
b = runif(20), w = rep(c(rep(.1,2), rep(.2,2)),5))
dat
#> id h d t q p b w
#> 1 1 1 1 1 0.20597457 0.4820801 0.47761962 0.1
#> 2 1 2 1 2 0.17655675 0.5995658 0.86120948 0.1
#> 3 1 1 1 2 0.68702285 0.4935413 0.43809711 0.2
#> 4 1 2 1 3 0.38410372 0.1862176 0.24479728 0.2
#> 5 1 1 1 1 0.76984142 0.8273733 0.07067905 0.1
#> 6 2 2 1 3 0.49769924 0.6684667 0.09946616 0.1
#> 7 2 1 1 3 0.71761851 0.7942399 0.31627171 0.2
#> 8 2 2 1 2 0.99190609 0.1079436 0.51863426 0.2
#> 9 2 1 1 2 0.38003518 0.7237109 0.66200508 0.1
#> 10 2 2 1 1 0.77744522 0.4112744 0.40683019 0.1
#> 11 1 1 2 1 0.93470523 0.8209463 0.91287592 0.2
#> 12 1 2 2 2 0.21214252 0.6470602 0.29360337 0.2
#> 13 1 1 2 2 0.65167377 0.7829328 0.45906573 0.1
#> 14 1 2 2 3 0.12555510 0.5530363 0.33239467 0.1
#> 15 1 1 2 1 0.26722067 0.5297196 0.65087047 0.2
#> 16 2 2 2 3 0.38611409 0.7893562 0.25801678 0.2
#> 17 2 1 2 3 0.01339033 0.0233312 0.47854525 0.1
#> 18 2 2 2 2 0.38238796 0.4772301 0.76631067 0.1
#> 19 2 1 2 2 0.86969085 0.7323137 0.08424691 0.2
#> 20 2 2 2 1 0.34034900 0.6927316 0.87532133 0.2
dat %>%
group_by(t, id, h, d) %>%
mutate(Sum = q * pnorm((p - b)/w)) %>%
mutate(result = sum(Sum))
#> # A tibble: 20 x 10
#> # Groups: t, id, h, d [18]
#> id h d t q p b w Sum result
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1. 1 1. 1. 0.206 0.482 0.478 0.100 0.107 8.76e-1
#> 2 1. 2 1. 2. 0.177 0.600 0.861 0.100 0.000784 7.84e-4
#> 3 1. 1 1. 2. 0.687 0.494 0.438 0.200 0.419 4.19e-1
#> 4 1. 2 1. 3. 0.384 0.186 0.245 0.200 0.148 1.48e-1
#> 5 1. 1 1. 1. 0.770 0.827 0.0707 0.100 0.770 8.76e-1
#> 6 2. 2 1. 3. 0.498 0.668 0.0995 0.100 0.498 4.98e-1
#> 7 2. 1 1. 3. 0.718 0.794 0.316 0.200 0.712 7.12e-1
#> 8 2. 2 1. 2. 0.992 0.108 0.519 0.200 0.0199 1.99e-2
#> 9 2. 1 1. 2. 0.380 0.724 0.662 0.100 0.278 2.78e-1
#> 10 2. 2 1. 1. 0.777 0.411 0.407 0.100 0.403 4.03e-1
#> 11 1. 1 2. 1. 0.935 0.821 0.913 0.200 0.302 3.75e-1
#> 12 1. 2 2. 2. 0.212 0.647 0.294 0.200 0.204 2.04e-1
#> 13 1. 1 2. 2. 0.652 0.783 0.459 0.100 0.651 6.51e-1
#> 14 1. 2 2. 3. 0.126 0.553 0.332 0.100 0.124 1.24e-1
#> 15 1. 1 2. 1. 0.267 0.530 0.651 0.200 0.0728 3.75e-1
#> 16 2. 2 2. 3. 0.386 0.789 0.258 0.200 0.385 3.85e-1
#> 17 2. 1 2. 3. 0.0134 0.0233 0.479 0.100 0.0000000356 3.56e-8
#> 18 2. 2 2. 2. 0.382 0.477 0.766 0.100 0.000735 7.35e-4
#> 19 2. 1 2. 2. 0.870 0.732 0.0842 0.200 0.869 8.69e-1
#> 20 2. 2 2. 1. 0.340 0.693 0.875 0.200 0.0615 6.15e-2
Created on 2018-03-16 by the reprex package (v0.2.0).
Upvotes: 2
Reputation: 97
I am not sure if this is what you expected
data <- split(data, data$id)
data <- lapply(data, function(i) {split(i,i$t)})
data <- unlist(data,recursive=FALSE)
data1 <- lapply(data, function(j)
{
res <- j[,c("q","p","b","w")]
j$result <- apply(res,1,function(i) i["q"] *pnorm( (i["p"] - i["b"]))/ i["w"] )
j
})
data1 <- do.call(rbind, data1)
Upvotes: 1