Reputation: 688

how to avoid a loop or sapply to speed up a computation

I have a very large dataset that can be described by the following R code:

set.seed(1)
data <- data.frame(id = rep(c(rep(1,5), rep(2,5)),2), h = rep(1:2,10), 
                   d = c(rep(1,10), rep(2,10)), t = rep(c(sample(c(1,2,3), 5, replace = T),
                   sample(c(1,2,3), 5, replace = T)),2), q = runif(20), p = runif(20),
                   b = runif(20), w = rep(c(rep(.1,2), rep(.2,2)),5))

where id, is the subject id and h and d are hours and days. Each id has multiple t. For each type of t (which can be 1, 2, or 3), indexed by t_i, in each h and d, I need to compute the summation over the t = t_i of q_i * pnorm((p_i - b_i)/ w_i). The output should ideally be an additional column to the data.frame data.

What is the fastest way to compute this? My dataset is very large, and I'm afraid that a for loop or sapply will take forever. I was thinking of using the aggregate function but I am not sure it works with expressions other than mean or sum.

---- FIXED A TYPO IN THE DGP ----

Example: Given the data (set.seed(1)) The last column give the output while Sum gives the multiplication q_i * pnorm((p_i - b_i)/ w_i) by row. Rows 1 and 4 have same conditioning variables and thus have the same value in the result column because this would be equal to q_1 * pnorm((p_1 - b_1)/ w_1) +q_5 * pnorm((p_5 - b_5)/ w_5)``. I'M SORRY IF I WAS NOT CLEAR.

     id h d t          q         p          b   w          Sum       result
1.1   1 1 1 1 0.20597457 0.4820801 0.47761962 0.1 1.066513e-01 8.764928e-01
2.2   1 2 1 2 0.17655675 0.5995658 0.86120948 0.1 7.843788e-04 7.843788e-04
2.3   1 1 1 2 0.68702285 0.4935413 0.43809711 0.2 4.185307e-01 4.185307e-01
3.4   1 2 1 3 0.38410372 0.1862176 0.24479728 0.2 1.478031e-01 1.478031e-01
1.5   1 1 1 1 0.76984142 0.8273733 0.07067905 0.1 7.698414e-01 8.764928e-01
3.6   2 2 1 3 0.49769924 0.6684667 0.09946616 0.1 4.976992e-01 4.976992e-01
3.7   2 1 1 3 0.71761851 0.7942399 0.31627171 0.2 7.115705e-01 7.115705e-01
2.8   2 2 1 2 0.99190609 0.1079436 0.51863426 0.2 1.985233e-02 1.985233e-02
2.9   2 1 1 2 0.38003518 0.7237109 0.66200508 0.1 2.779585e-01 2.779585e-01
1.10  2 2 1 1 0.77744522 0.4112744 0.40683019 0.1 4.025021e-01 4.025021e-01
1.11  1 1 2 1 0.93470523 0.8209463 0.91287592 0.2 3.018017e-01 3.745763e-01
2.12  1 2 2 2 0.21214252 0.6470602 0.29360337 0.2 2.039559e-01 2.039559e-01
2.13  1 1 2 2 0.65167377 0.7829328 0.45906573 0.1 6.512825e-01 6.512825e-01
3.14  1 2 2 3 0.12555510 0.5530363 0.33239467 0.1 1.238378e-01 1.238378e-01
1.15  1 1 2 1 0.26722067 0.5297196 0.65087047 0.2 7.277459e-02 3.745763e-01
3.16  2 2 2 3 0.38611409 0.7893562 0.25801678 0.2 3.845907e-01 3.845907e-01
3.17  2 1 2 3 0.01339033 0.0233312 0.47854525 0.1 3.555325e-08 3.555325e-08
2.18  2 2 2 2 0.38238796 0.4772301 0.76631067 0.1 7.346728e-04 7.346728e-04
2.19  2 1 2 2 0.86969085 0.7323137 0.08424691 0.2 8.691717e-01 8.691717e-01
1.20  2 2 2 1 0.34034900 0.6927316 0.87532133 0.2 6.147884e-02 6.147884e-02

Upvotes: 0

Answers (3)

Jilber Urbina

Reputation: 61154

Here's an R base approach using ave. I am not sure if this solves your problem, but here's an attempt for you to figure out how to write the expression:

data$Aggregated.Sum <- ave(data[, c("q", "p", "b", "w")], 
                            data[,c("id", "h", "d", "t")], 
                            FUN=function(x){
                              sum(x$q * pnorm((x$p - x$b)/ x$w))
                            })[, 1]

Giving the following output:

   id h d t          q         p          b   w   Aggregated.Sum
1   1 1 1 1 0.20597457 0.4820801 0.47761962 0.1 0.87649276750008
2   1 2 1 2 0.17655675 0.5995658 0.86120948 0.1 0.00078437884830
3   1 1 1 2 0.68702285 0.4935413 0.43809711 0.2 0.41853074006211
4   1 2 1 3 0.38410372 0.1862176 0.24479728 0.2 0.14780307785185
5   1 1 1 1 0.76984142 0.8273733 0.07067905 0.1 0.87649276750008
6   2 2 1 3 0.49769924 0.6684667 0.09946616 0.1 0.49769923892396
7   2 1 1 3 0.71761851 0.7942399 0.31627171 0.2 0.71157053468214
8   2 2 1 2 0.99190609 0.1079436 0.51863426 0.2 0.01985232872822
9   2 1 1 2 0.38003518 0.7237109 0.66200508 0.1 0.27795848822967
10  2 2 1 1 0.77744522 0.4112744 0.40683019 0.1 0.40250214883360
11  1 1 2 1 0.93470523 0.8209463 0.91287592 0.2 0.37457632092225
12  1 2 2 2 0.21214252 0.6470602 0.29360337 0.2 0.20395587133630
13  1 1 2 2 0.65167377 0.7829328 0.45906573 0.1 0.65128247418102
14  1 2 2 3 0.12555510 0.5530363 0.33239467 0.1 0.12383782495223
15  1 1 2 1 0.26722067 0.5297196 0.65087047 0.2 0.37457632092225
16  2 2 2 3 0.38611409 0.7893562 0.25801678 0.2 0.38459067414776
17  2 1 2 3 0.01339033 0.0233312 0.47854525 0.1 0.00000003555325
18  2 2 2 2 0.38238796 0.4772301 0.76631067 0.1 0.00073467275547
19  2 1 2 2 0.86969085 0.7323137 0.08424691 0.2 0.86917168501551
20  2 2 2 1 0.34034900 0.6927316 0.87532133 0.2 0.06147884477362

Upvotes: 1

Tung

Reputation: 28371

We group_by(t, id, h, d) then calculate the Sum for each row then finally calculate the total sum result for rows that have the same t, id, h, d

library(tidyverse)

set.seed(1)
options(scipen = 999)

dat <- data.frame(id = rep(c(rep(1,5), rep(2,5)),2), h = rep(1:2,5), 
                   d = c(rep(1,10), rep(2,10)), 
                   t = rep(c(sample(c(1,2,3), 5, replace = T),
                   sample(c(1,2,3), 5, replace = T)),2), q = runif(20), p = runif(20),
                   b = runif(20), w = rep(c(rep(.1,2), rep(.2,2)),5))
dat

#>    id h d t          q         p          b   w
#> 1   1 1 1 1 0.20597457 0.4820801 0.47761962 0.1
#> 2   1 2 1 2 0.17655675 0.5995658 0.86120948 0.1
#> 3   1 1 1 2 0.68702285 0.4935413 0.43809711 0.2
#> 4   1 2 1 3 0.38410372 0.1862176 0.24479728 0.2
#> 5   1 1 1 1 0.76984142 0.8273733 0.07067905 0.1
#> 6   2 2 1 3 0.49769924 0.6684667 0.09946616 0.1
#> 7   2 1 1 3 0.71761851 0.7942399 0.31627171 0.2
#> 8   2 2 1 2 0.99190609 0.1079436 0.51863426 0.2
#> 9   2 1 1 2 0.38003518 0.7237109 0.66200508 0.1
#> 10  2 2 1 1 0.77744522 0.4112744 0.40683019 0.1
#> 11  1 1 2 1 0.93470523 0.8209463 0.91287592 0.2
#> 12  1 2 2 2 0.21214252 0.6470602 0.29360337 0.2
#> 13  1 1 2 2 0.65167377 0.7829328 0.45906573 0.1
#> 14  1 2 2 3 0.12555510 0.5530363 0.33239467 0.1
#> 15  1 1 2 1 0.26722067 0.5297196 0.65087047 0.2
#> 16  2 2 2 3 0.38611409 0.7893562 0.25801678 0.2
#> 17  2 1 2 3 0.01339033 0.0233312 0.47854525 0.1
#> 18  2 2 2 2 0.38238796 0.4772301 0.76631067 0.1
#> 19  2 1 2 2 0.86969085 0.7323137 0.08424691 0.2
#> 20  2 2 2 1 0.34034900 0.6927316 0.87532133 0.2

dat %>% 
  group_by(t, id, h, d) %>% 
  mutate(Sum = q * pnorm((p - b)/w)) %>% 
  mutate(result = sum(Sum))

#> # A tibble: 20 x 10
#> # Groups:   t, id, h, d [18]
#>       id     h     d     t      q      p      b     w          Sum  result
#>    <dbl> <int> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>        <dbl>   <dbl>
#>  1    1.     1    1.    1. 0.206  0.482  0.478  0.100 0.107        8.76e-1
#>  2    1.     2    1.    2. 0.177  0.600  0.861  0.100 0.000784     7.84e-4
#>  3    1.     1    1.    2. 0.687  0.494  0.438  0.200 0.419        4.19e-1
#>  4    1.     2    1.    3. 0.384  0.186  0.245  0.200 0.148        1.48e-1
#>  5    1.     1    1.    1. 0.770  0.827  0.0707 0.100 0.770        8.76e-1
#>  6    2.     2    1.    3. 0.498  0.668  0.0995 0.100 0.498        4.98e-1
#>  7    2.     1    1.    3. 0.718  0.794  0.316  0.200 0.712        7.12e-1
#>  8    2.     2    1.    2. 0.992  0.108  0.519  0.200 0.0199       1.99e-2
#>  9    2.     1    1.    2. 0.380  0.724  0.662  0.100 0.278        2.78e-1
#> 10    2.     2    1.    1. 0.777  0.411  0.407  0.100 0.403        4.03e-1
#> 11    1.     1    2.    1. 0.935  0.821  0.913  0.200 0.302        3.75e-1
#> 12    1.     2    2.    2. 0.212  0.647  0.294  0.200 0.204        2.04e-1
#> 13    1.     1    2.    2. 0.652  0.783  0.459  0.100 0.651        6.51e-1
#> 14    1.     2    2.    3. 0.126  0.553  0.332  0.100 0.124        1.24e-1
#> 15    1.     1    2.    1. 0.267  0.530  0.651  0.200 0.0728       3.75e-1
#> 16    2.     2    2.    3. 0.386  0.789  0.258  0.200 0.385        3.85e-1
#> 17    2.     1    2.    3. 0.0134 0.0233 0.479  0.100 0.0000000356 3.56e-8
#> 18    2.     2    2.    2. 0.382  0.477  0.766  0.100 0.000735     7.35e-4
#> 19    2.     1    2.    2. 0.870  0.732  0.0842 0.200 0.869        8.69e-1
#> 20    2.     2    2.    1. 0.340  0.693  0.875  0.200 0.0615       6.15e-2

Created on 2018-03-16 by the reprex package (v0.2.0).

Upvotes: 2

Mohammad Tanvir Ahamed

Reputation: 97

I am not sure if this is what you expected

data <- split(data, data$id)
data <- lapply(data, function(i) {split(i,i$t)})
data <- unlist(data,recursive=FALSE)

data1 <- lapply(data, function(j)
  {
  res <- j[,c("q","p","b","w")]
  j$result <- apply(res,1,function(i) i["q"] *pnorm( (i["p"] - i["b"]))/ i["w"] )
  j
  })

data1 <- do.call(rbind, data1)

Upvotes: 1

how to avoid a loop or sapply to speed up a computation

Answers (3)

Related Questions