Reputation: 25
people_id activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
(fctr) (fctr) (int) (int) (dbl) (int) (int) (dbl) (dbl)
1 ppl_100 act2_1734928 0 1 0 0 1 0 NA
2 ppl_100 act2_2434093 0 1 0 0 2 0 0
3 ppl_100 act2_3404049 0 1 0 0 3 0 0
4 ppl_100 act2_3651215 0 1 0 0 4 0 0
5 ppl_100 act2_4109017 0 1 0 0 5 0 0
6 ppl_100 act2_898576 0 1 0 0 6 0 0
7 ppl_100002 act2_1233489 1 1 1 1 1 1 1
8 ppl_100002 act2_1623405 1 1 1 2 2 1 0
9 ppl_100003 act2_1111598 1 1 1 1 1 1 0
10 ppl_100003 act2_1177453 1 1 1 2 2 1 0
I've this sample data frame. I want to create a variable success_rate_trend
using cum_success_rate
variable. The challenge is that I want it to compute for every activity_id
except the first activity for every unique people_id
i.e I want to capture success trend for unique people_id
. I'm using the below code:
success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
success_rate_trend[i] = NA
}
else {
success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
}}
It takes forever to run. I've close to million rows in succ_rate_df
dataframe. Can Anyone suggest how to simplify the code and reduce the run time.
Upvotes: 1
Views: 131
Reputation: 263352
I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput
so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat
. The ave
function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave
.
success_rate_trend <- with( dat,
ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )
success_rate_trend
[1] NA 0 0 0 0 0 NA 0 NA 0
# not a very interesting result
Upvotes: 0
Reputation: 73315
Use vectorization:
success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_
Note:
people_id
is a factor variable (fctr)
. To use diff()
we must use as.integer()
or unclass()
to remove the factor class.tbl_df
from dplyr
. Matrix like indexing does not work. Use succ_rate_df$people_id
or succ_rate_df[["people_id"]]
instead of succ_rate_df[, 1]
.Upvotes: 3
Reputation: 60462
You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.
n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)
success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA
The answer by Zheyuan Li is neater.
Upvotes: 1