Reputation: 1897
I'm running a script against ~1m records that gets an aggregation for each day in the dataset and binds it to a full dataset and it's taking a really long time. 90% of the time in each loop is spent in the filter() portion.
Is there any way to speed this up?
for (i in seq_along(date_range)){
todays_results <- state_change %>% filter(state_date <= date_range[i] & date_range[i] <= next_state_dt)
todays_results <- todays_results %>% group_by(state_date, marketing_category) %>% count()
daily_state_counts <- rbind(daily_state_counts, todays_results)
}
Here's a sample of the data
contact_id state_date marketing_category next_state_dt next_mc
<fct> <date> <fct> <date> <fct>
1 003U000000Dyl6Z 2014-01-07 dormant NA NA
2 003U000000DylwT 2014-01-07 dormant NA NA
3 0030P00001ypDpt 2014-01-07 dormant 2014-02-12 recruiter
4 0030P00001ypFZW 2014-01-07 dormant 2014-07-30 recruiter
5 003U000000Io07V 2014-01-07 dormant 2017-02-13 recruiter
6 0030P00001ypE7S 2014-01-07 recruiter 2015-07-08 dormant
7 0030P00001yOP1K 2014-01-07 dormant 2014-05-15 recruiter
8 003U000000LNi2C 2014-01-07 dormant NA NA
9 003U000000DyjAb 2014-01-07 dormant 2014-11-20 recruiter
10 003U000001Z6yb6 2014-01-07 dormant 2016-01-15 recruiter
11 003U000000Oc9xR 2014-01-07 dormant 2015-10-21 inactive
Upvotes: 0
Views: 247
Reputation: 11728
An easy way to avoid growing objects is to use lapply()
and then do.call()
.
Try
lapply(seq_along(date_range), function(i) {
state_change %>%
filter(state_date <= date_range[i] & date_range[i] <= next_state_dt) %>%
group_by(state_date, marketing_category) %>%
count()
}) %>%
do.call("rbind", .)
Upvotes: 1