Abhi
Abhi

Reputation: 25

For loop in R takes forever to run

    people_id  activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
       (fctr)       (fctr)   (int)    (int)        (dbl)       (int)   (int)            (dbl)              (dbl)
1     ppl_100 act2_1734928       0        1            0           0       1                0                 NA
2     ppl_100 act2_2434093       0        1            0           0       2                0                  0
3     ppl_100 act2_3404049       0        1            0           0       3                0                  0
4     ppl_100 act2_3651215       0        1            0           0       4                0                  0
5     ppl_100 act2_4109017       0        1            0           0       5                0                  0
6     ppl_100  act2_898576       0        1            0           0       6                0                  0
7  ppl_100002 act2_1233489       1        1            1           1       1                1                  1
8  ppl_100002 act2_1623405       1        1            1           2       2                1                  0
9  ppl_100003 act2_1111598       1        1            1           1       1                1                  0
10 ppl_100003 act2_1177453       1        1            1           2       2                1                  0

I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:

success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
     if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
         success_rate_trend[i] = NA
       }
        else {
          success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
    }}

It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.

Upvotes: 1

Views: 131

Answers (3)

IRTFM
IRTFM

Reputation: 263352

I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat. The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave.

 success_rate_trend <- with( dat, 
                    ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )

 success_rate_trend
 [1] NA  0  0  0  0  0 NA  0 NA  0
 # not a very interesting result

Upvotes: 0

Zheyuan Li
Zheyuan Li

Reputation: 73315

Use vectorization:

success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_

Note:

  1. people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
  2. You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].

Upvotes: 3

csgillespie
csgillespie

Reputation: 60462

You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.

n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)

success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA

The answer by Zheyuan Li is neater.

Upvotes: 1

Related Questions