Reputation: 3722
I'm trying to find specific patterns in groups within a dataframe. Take the following dataframe of orders placed, by email, the person placing the order, and the amount.
set.seed(123)
dates = sample(seq(as.Date("2017-01-01"),as.Date("2017-12-31"), by = 'day'), 2000, replace = TRUE)
amount <- sample(-50:100, 2000, replace = TRUE)
placedorder <- sample(c(NA, NA, NA, "jeff", "alex", "steve", "amy", "john", "larry", "ryan"), 2000, replace = TRUE)
email <- sample(paste0(1:200, "@gmail.com"), 2000, replace = TRUE)
df <- data.frame(dates, email, placedorder, amount, stringsAsFactors = FALSE)
I would like to find groups by email address where all three of these occur, and where they occur in date succession:
placedorder
is NAplacedorder
is NAplaceholder
is not NAexample:
# A tibble: 10 x 4
# Groups: email [1]
dates email placedorder amount
<date> <chr> <chr> <int>
1 2017-02-10 [email protected] <NA> 68 # satisfies #1
2 2017-02-27 [email protected] <NA> -21 # satisfies #2
3 2017-03-07 [email protected] jeff -9
4 2017-03-09 [email protected] steve -93
5 2017-03-14 [email protected] steve 22 # satisfies #3
6 2017-03-18 [email protected] steve -81
7 2017-04-28 [email protected] <NA> -12
8 2017-05-06 [email protected] <NA> 4
9 2017-06-03 [email protected] jeff -40
10 2017-06-03 [email protected] larry 13 #(this also satisfies #3)
the above example is all within the same email
, and each of the 3 filters happen one after another with respect to time.
My attempt, I think finds where these occur, but doesn't take into account the dates and this happening in succession. Also, actually filtering it down to just these orders, would be best.
df2 <- df %>%
group_by(email) %>%
filter(any(is.na(placedorder) & amount > 0),
any(is.na(placedorder) & amount < 0),
any(!is.na(placedorder) & amount > 0)
)
Thanks in advance!
Upvotes: 1
Views: 941
Reputation: 3045
Assuming my interpretation of "first order" and "second order" is correct, here's one way to set up the commands in dplyr
library(dplyr)
df %>% group_by(email) %>%
arrange(email, dates) %>%
mutate(order_num=1:n()) %>%
#An order was placed with a positive value and where placedorder is NA
filter((is.na(placedorder) & amount>0) |
# An order was placed after the first one, with a negative value and where placedorder is NA
(is.na(placedorder) & amount <0 & order_num >1) |
# An order was placed after the second order, with a positive value and where placeholder is not NA
(!is.na(placedorder) & amount >0 & order_num > 2)
)
UPDATE: Thank you very much for clarifying the question. Basically you want to "watch the customer state" and start tracing next type of event only after you have observed the previous type. Here's a (slightly verbose, but hopefully understandable) attempt at tracing the customer transitioning through these "states":
df %>% group_by(email) %>%
arrange(email, dates) %>%
mutate(event_1=ifelse(is.na(placedorder) & amount>0, 1, 0),
post_event_1=cumsum(event_1),
# only if at least one event_1 has happened
event_2=ifelse(post_event_1>=1 & is.na(placedorder) & amount <0, 1,0),
post_event_2=cumsum(event_2),
# only if at least one event_2 has happened
event_3=ifelse(post_event_2>=1 & !is.na(placedorder) & amount >0, 1, 0)) %>%
# only interested in first occurance of event_1 and event_2 preceding event_3
filter((event_1==1 & post_event_1==1) | (event_2==1 & post_event_2==1) | event_3 ==1)
# A tibble: 390 x 9
# Groups: email [165]
dates email placedorder amount event_1 post_event_1 event_2 post_event_2 event_3
<date> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017-01-29 [email protected] <NA> 76 1 1 0 0 0
2 2017-05-25 [email protected] <NA> -37 0 1 1 1 0
3 2017-08-14 [email protected] steve 53 0 1 0 2 1
4 2017-12-21 [email protected] john 92 0 2 0 4 1
5 2017-02-08 [email protected] <NA> 89 1 1 0 0 0
6 2017-01-16 [email protected] <NA> 40 1 1 0 0 0
7 2017-03-18 [email protected] <NA> 20 1 1 0 0 0
8 2017-05-16 [email protected] <NA> -45 0 2 1 1 0
9 2017-06-08 [email protected] larry 46 0 2 0 2 1
10 2017-07-22 [email protected] john 93 0 3 0 2 1
# ... with 380 more rows
There are some "uncompleted chains" e.g. when customer progressed to state_1 but not further. Not sure if you want to drop those (for that you could count observations per email and drop those that have less than 3 records).
Upvotes: 2