Reputation: 859
I am in the process of cleaning some data on customer behavior and I want to identify and filter out the second time that customers perform a certain event. As this is one element of a larger data manipulation project, ideally I would like to add this as three lines to a longer dplyr chain.
There are two pieces of logic involved here:
This is the dplyr code I have used for each of these operations:
mutate(second_time = ifelse(lag(pct_completed,1) > 0.9 & pct_completed < .2 & lag(customer_id,1) == customer_id, "yes", "no")) %>%
mutate(second_time = ifelse(lag(second_time,1)=="yes"&lag(customer_id,1) == customer_id, "yes", second_time)) %>%
filter(second_time = "no")
If I run the second mutate function above in a normal dplyr chain it will only mark the second row in the second event as "yes", not any subsequent ones. Hence I believe that I need to write a for loop to catch all of the remaining events in the second phase.
I've attached some sample data here to illustrate the problem.
If anyone has any good suggestions about how to write a for loop that can be incorporated into a dplyr chain, I would greatly appreciate it.
Upvotes: 1
Views: 1450
Reputation: 35187
I propose an easier solution.
.
df %>%
group_by(customer_id) %>%
mutate(flag = row_number() > 1 & #to avoid NA's on the first rows of each customer
(pct_completed < 0.2) &
(lag(pct_completed) > 0.9),
event_nr = cumsum(flag) + 1) %>%
filter(event_nr == 2)
Upvotes: 2