Randall Helms
Randall Helms

Reputation: 859

Including a for loop within a dplyr chain

I am in the process of cleaning some data on customer behavior and I want to identify and filter out the second time that customers perform a certain event. As this is one element of a larger data manipulation project, ideally I would like to add this as three lines to a longer dplyr chain.

There are two pieces of logic involved here:

  1. Identifying the first row of the second event: The previous row has an end_pct value higher than 0.9, the current row has an end_pct value value below 0.2, and the previous and current rows have the same customer_id
  2. Identifying the subsequent rows: If the previous row has ‘yes’ for second_event and the previous and current rows have the same customer_id
  3. Filter out the second time events

This is the dplyr code I have used for each of these operations:

mutate(second_time = ifelse(lag(pct_completed,1) > 0.9 & pct_completed < .2 & lag(customer_id,1) == customer_id, "yes", "no")) %>%
mutate(second_time = ifelse(lag(second_time,1)=="yes"&lag(customer_id,1) == customer_id, "yes", second_time)) %>%
filter(second_time = "no")

If I run the second mutate function above in a normal dplyr chain it will only mark the second row in the second event as "yes", not any subsequent ones. Hence I believe that I need to write a for loop to catch all of the remaining events in the second phase.

I've attached some sample data here to illustrate the problem.

If anyone has any good suggestions about how to write a for loop that can be incorporated into a dplyr chain, I would greatly appreciate it.

Upvotes: 1

Views: 1450

Answers (1)

Axeman
Axeman

Reputation: 35187

I propose an easier solution.

  • We group by customer (since everything is calculated within customer).
  • We detect a change of event, similar to your current logic and set a flag.
  • We do a cumulative sum of flags to count which event we are in.
  • We filter on when the cumulative sum is 2.

.

df %>% 
  group_by(customer_id) %>% 
  mutate(flag = row_number() > 1 &        #to avoid NA's on the first rows of each customer
           (pct_completed < 0.2) & 
           (lag(pct_completed) > 0.9),
         event_nr = cumsum(flag) + 1) %>% 
  filter(event_nr == 2)

Upvotes: 2

Related Questions