time series group by and filter in dplyr r

Question

I'm trying to find specific patterns in groups within a dataframe. Take the following dataframe of orders placed, by email, the person placing the order, and the amount.

set.seed(123)
dates = sample(seq(as.Date("2017-01-01"),as.Date("2017-12-31"), by = 'day'), 2000, replace = TRUE)
amount <- sample(-50:100, 2000, replace = TRUE)
placedorder <- sample(c(NA, NA, NA, "jeff", "alex", "steve", "amy", "john", "larry", "ryan"), 2000, replace = TRUE)
email <- sample(paste0(1:200, "@gmail.com"), 2000, replace = TRUE)
df <- data.frame(dates, email, placedorder, amount, stringsAsFactors = FALSE)

I would like to find groups by email address where all three of these occur, and where they occur in date succession:

An order was placed with a positive value and where placedorder is NA
An order was placed after the above step, with a negative value and where placedorder is NA
An order was placed after the step 2, with a positive value and where placeholder is not NA

example:

# A tibble: 10 x 4
# Groups:   email [1]
        dates       email placedorder amount
                       
 1 2017-02-10 1@gmail.com             68 # satisfies #1
 2 2017-02-27 1@gmail.com            -21 # satisfies #2
 3 2017-03-07 1@gmail.com        jeff     -9
 4 2017-03-09 1@gmail.com       steve    -93
 5 2017-03-14 1@gmail.com       steve     22 # satisfies #3
 6 2017-03-18 1@gmail.com       steve    -81
 7 2017-04-28 1@gmail.com            -12
 8 2017-05-06 1@gmail.com              4
 9 2017-06-03 1@gmail.com        jeff    -40
10 2017-06-03 1@gmail.com       larry     13 #(this also satisfies #3)

the above example is all within the same email, and each of the 3 filters happen one after another with respect to time.

My attempt, I think finds where these occur, but doesn't take into account the dates and this happening in succession. Also, actually filtering it down to just these orders, would be best.

df2 <- df %>%
  group_by(email) %>%
  filter(any(is.na(placedorder) & amount > 0),
         any(is.na(placedorder) & amount < 0),
         any(!is.na(placedorder) & amount > 0)
  )

Thanks in advance!

dmi3kno · Accepted Answer

Assuming my interpretation of "first order" and "second order" is correct, here's one way to set up the commands in dplyr

library(dplyr)

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(order_num=1:n()) %>% 
  #An order was placed with a positive value and where placedorder is NA
  filter((is.na(placedorder) & amount>0) |
  # An order was placed after the first one, with a negative value and where placedorder is NA
         (is.na(placedorder) & amount <0 & order_num >1) |
  # An order was placed after the second order, with a positive value and where placeholder is not NA
        (!is.na(placedorder) & amount >0 & order_num > 2)
    )

UPDATE: Thank you very much for clarifying the question. Basically you want to "watch the customer state" and start tracing next type of event only after you have observed the previous type. Here's a (slightly verbose, but hopefully understandable) attempt at tracing the customer transitioning through these "states":

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(event_1=ifelse(is.na(placedorder) & amount>0, 1, 0),
         post_event_1=cumsum(event_1),
         # only if at least one event_1 has happened
         event_2=ifelse(post_event_1>=1 & is.na(placedorder) & amount <0, 1,0),
         post_event_2=cumsum(event_2),
         # only if at least one event_2 has happened
         event_3=ifelse(post_event_2>=1 & !is.na(placedorder) & amount >0, 1, 0)) %>% 
  # only interested in first occurance of event_1 and event_2 preceding event_3
  filter((event_1==1 & post_event_1==1) | (event_2==1 & post_event_2==1) | event_3 ==1)

# A tibble: 390 x 9
# Groups:   email [165]
        dates         email placedorder amount event_1 post_event_1 event_2 post_event_2 event_3
                                                  
 1 2017-01-29   1@gmail.com             76       1            1       0            0       0
 2 2017-05-25   1@gmail.com            -37       0            1       1            1       0
 3 2017-08-14   1@gmail.com       steve     53       0            1       0            2       1
 4 2017-12-21   1@gmail.com        john     92       0            2       0            4       1
 5 2017-02-08 100@gmail.com             89       1            1       0            0       0
 6 2017-01-16 101@gmail.com             40       1            1       0            0       0
 7 2017-03-18 102@gmail.com             20       1            1       0            0       0
 8 2017-05-16 102@gmail.com            -45       0            2       1            1       0
 9 2017-06-08 102@gmail.com       larry     46       0            2       0            2       1
10 2017-07-22 102@gmail.com        john     93       0            3       0            2       1
# ... with 380 more rows

There are some "uncompleted chains" e.g. when customer progressed to state_1 but not further. Not sure if you want to drop those (for that you could count observations per email and drop those that have less than 3 records).

time series group by and filter in dplyr r

Answers (1)

Related Questions