Matt W.
Matt W.

Reputation: 3722

time series group by and filter in dplyr r

I'm trying to find specific patterns in groups within a dataframe. Take the following dataframe of orders placed, by email, the person placing the order, and the amount.

set.seed(123)
dates = sample(seq(as.Date("2017-01-01"),as.Date("2017-12-31"), by = 'day'), 2000, replace = TRUE)
amount <- sample(-50:100, 2000, replace = TRUE)
placedorder <- sample(c(NA, NA, NA, "jeff", "alex", "steve", "amy", "john", "larry", "ryan"), 2000, replace = TRUE)
email <- sample(paste0(1:200, "@gmail.com"), 2000, replace = TRUE)
df <- data.frame(dates, email, placedorder, amount, stringsAsFactors = FALSE)

I would like to find groups by email address where all three of these occur, and where they occur in date succession:

  1. An order was placed with a positive value and where placedorder is NA
  2. An order was placed after the above step, with a negative value and where placedorder is NA
  3. An order was placed after the step 2, with a positive value and where placeholder is not NA

example:

# A tibble: 10 x 4
# Groups:   email [1]
        dates       email placedorder amount
       <date>       <chr>       <chr>  <int>
 1 2017-02-10 [email protected]        <NA>     68 # satisfies #1
 2 2017-02-27 [email protected]        <NA>    -21 # satisfies #2
 3 2017-03-07 [email protected]        jeff     -9
 4 2017-03-09 [email protected]       steve    -93
 5 2017-03-14 [email protected]       steve     22 # satisfies #3
 6 2017-03-18 [email protected]       steve    -81
 7 2017-04-28 [email protected]        <NA>    -12
 8 2017-05-06 [email protected]        <NA>      4
 9 2017-06-03 [email protected]        jeff    -40
10 2017-06-03 [email protected]       larry     13 #(this also satisfies #3)

the above example is all within the same email, and each of the 3 filters happen one after another with respect to time.

My attempt, I think finds where these occur, but doesn't take into account the dates and this happening in succession. Also, actually filtering it down to just these orders, would be best.

df2 <- df %>%
  group_by(email) %>%
  filter(any(is.na(placedorder) & amount > 0),
         any(is.na(placedorder) & amount < 0),
         any(!is.na(placedorder) & amount > 0)
  )

Thanks in advance!

Upvotes: 1

Views: 941

Answers (1)

dmi3kno
dmi3kno

Reputation: 3045

Assuming my interpretation of "first order" and "second order" is correct, here's one way to set up the commands in dplyr

library(dplyr)

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(order_num=1:n()) %>% 
  #An order was placed with a positive value and where placedorder is NA
  filter((is.na(placedorder) & amount>0) |
  # An order was placed after the first one, with a negative value and where placedorder is NA
         (is.na(placedorder) & amount <0 & order_num >1) |
  # An order was placed after the second order, with a positive value and where placeholder is not NA
        (!is.na(placedorder) & amount >0 & order_num > 2)
    )

UPDATE: Thank you very much for clarifying the question. Basically you want to "watch the customer state" and start tracing next type of event only after you have observed the previous type. Here's a (slightly verbose, but hopefully understandable) attempt at tracing the customer transitioning through these "states":

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(event_1=ifelse(is.na(placedorder) & amount>0, 1, 0),
         post_event_1=cumsum(event_1),
         # only if at least one event_1 has happened
         event_2=ifelse(post_event_1>=1 & is.na(placedorder) & amount <0, 1,0),
         post_event_2=cumsum(event_2),
         # only if at least one event_2 has happened
         event_3=ifelse(post_event_2>=1 & !is.na(placedorder) & amount >0, 1, 0)) %>% 
  # only interested in first occurance of event_1 and event_2 preceding event_3
  filter((event_1==1 & post_event_1==1) | (event_2==1 & post_event_2==1) | event_3 ==1)

# A tibble: 390 x 9
# Groups:   email [165]
        dates         email placedorder amount event_1 post_event_1 event_2 post_event_2 event_3
       <date>         <chr>       <chr>  <int>   <dbl>        <dbl>   <dbl>        <dbl>   <dbl>
 1 2017-01-29   [email protected]        <NA>     76       1            1       0            0       0
 2 2017-05-25   [email protected]        <NA>    -37       0            1       1            1       0
 3 2017-08-14   [email protected]       steve     53       0            1       0            2       1
 4 2017-12-21   [email protected]        john     92       0            2       0            4       1
 5 2017-02-08 [email protected]        <NA>     89       1            1       0            0       0
 6 2017-01-16 [email protected]        <NA>     40       1            1       0            0       0
 7 2017-03-18 [email protected]        <NA>     20       1            1       0            0       0
 8 2017-05-16 [email protected]        <NA>    -45       0            2       1            1       0
 9 2017-06-08 [email protected]       larry     46       0            2       0            2       1
10 2017-07-22 [email protected]        john     93       0            3       0            2       1
# ... with 380 more rows

There are some "uncompleted chains" e.g. when customer progressed to state_1 but not further. Not sure if you want to drop those (for that you could count observations per email and drop those that have less than 3 records).

Upvotes: 2

Related Questions