Davide Lorino
Davide Lorino

Reputation: 915

Group By and Filter Between Dates

I'm trying filter an order table down to each persons first year worth of orders.

My data is in the following format, in which each row represents an order, but I have added customer-level columns to represent their first order date (Recruitment Date) as well as the date that marks a year since each customers first order (1st Year Since Recruitment) and second year since recruitment; the last column is the amount paid on the current order.

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1876202 obs. of  6 variables:
 $ Brand_Acc                 : chr  "B000000001" "B000000002" "B000000002" "B000000002" ...
 $ salesdate                 : Date, format: "2008-03-10" "2008-02-19" "2008-07-14" "2010-08-25" ...
 $ Recruitment Date          : Date, format: "2008-03-10" "2008-02-19" NA NA ...
 $ 1st Year Since Recruitment: Date, format: "2009-03-10" "2009-02-19" NA NA ...
 $ 2nd Year Since Recruitment: Date, format: "2010-03-10" "2010-02-19" NA NA ...
 $ TotalDiscount             : num  97.9 349.9 184.9 284.9 348.9 ...

I want to return a dataframe that captures each customers first year worth of orders.

I have tried the following:

df %>%
  group_by(Brand_Acc) %>%
  filter(salesdate, between(`Recruitment Date`, `1st Year Since Recruitment`))

But I get this error:

Error in filter_impl(.data, quo) : Evaluation error: argument "right" is missing, with no default.

What is the correct way of doing this?

Edit showing dput of first 5 rows:

dput(df)
structure(list(Brand_Acc = c("B000000001", "B000000002", "B000000002", 
"B000000002", "B000000006"), salesdate = structure(c(13948, 13928, 
14074, 14846, 13934), class = "Date"), ordertype = c("Recruitment", 
"Recruitment", "Conversion", "Active Order", "Recruitment"), 
    actv_channel = c("MainMail", "MainMail", "Outbound-Other", 
    "MainMail", "MainMail"), TotalDiscount = c(97.87, 349.88, 
    184.94, 284.94, 348.9), campaignparentid = c("9017", "9017", 
    "9035", "9557", "9017"), BrandAccount_Brand = c("wp", "wp", 
    "wp", "wp", "wp"), recrtype = c("STNRD", "STNRD", "STNRD", 
    "STNRD", "STNRD"), POA_CODE = structure(c(1937L, 2302L, 2302L, 
    2302L, 466L), .Label = c("0", "200", "800", "801", "804"), class = "factor"), 
    `Recruitment Date` = structure(c(13948, 13928, NA, NA, 13934
    ), class = "Date"), `1st Year Since Recruitment` = structure(c(14313, 
    14294, NA, NA, 14300), class = "Date"), `2nd Year Since Recruitment` = structure(c(14678, 
    14659, NA, NA, 14665), class = "Date"), `3rd Year Since Recruitment` = structure(c(15043, 
    15024, NA, NA, 15030), class = "Date")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))
> ```





Upvotes: 0

Views: 377

Answers (1)

Calum You
Calum You

Reputation: 15072

Surprisingly between is not vectorised in its left and right arguments, which you might have expected since it describes itself as a simple replacement for the combination of <= and >=. We simply have to resort to doing it the long way:

library(tidyverse)
df <- structure(list(Brand_Acc = c("B000000001", "B000000002", "B000000002", "B000000002", "B000000006"), salesdate = structure(c(13948, 13928, 14074, 14846, 13934), class = "Date"), ordertype = c("Recruitment", "Recruitment", "Conversion", "Active Order", "Recruitment"), actv_channel = c("MainMail", "MainMail", "Outbound-Other", "MainMail", "MainMail"), TotalDiscount = c(97.87, 349.88, 184.94, 284.94, 348.9), campaignparentid = c("9017", "9017", "9035", "9557", "9017"), BrandAccount_Brand = c("wp", "wp", "wp", "wp", "wp"), recrtype = c("STNRD", "STNRD", "STNRD", "STNRD", "STNRD"), POA_CODE = structure(c(1937L, 2302L, 2302L, 2302L, 466L), .Label = c("0", "200", "800", "801", "804"), class = "factor"), `Recruitment Date` = structure(c(13948, 13928, NA, NA, 13934), class = "Date"), `1st Year Since Recruitment` = structure(c(14313, 14294, NA, NA, 14300), class = "Date"), `2nd Year Since Recruitment` = structure(c(14678, 14659, NA, NA, 14665), class = "Date"), `3rd Year Since Recruitment` = structure(c(15043, 15024, NA, NA, 15030), class = "Date")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
df %>%
  group_by(Brand_Acc) %>%
  filter(salesdate >= `Recruitment Date` & salesdate <= `1st Year Since Recruitment`)
#> # A tibble: 3 x 13
#> # Groups:   Brand_Acc [3]
#>   Brand_Acc salesdate  ordertype actv_channel TotalDiscount
#>   <chr>     <date>     <chr>     <chr>                <dbl>
#> 1 B0000000… 2008-03-10 Recruitm… MainMail              97.9
#> 2 B0000000… 2008-02-19 Recruitm… MainMail             350. 
#> 3 B0000000… 2008-02-25 Recruitm… MainMail             349. 
#> # … with 8 more variables: campaignparentid <chr>,
#> #   BrandAccount_Brand <chr>, recrtype <chr>, POA_CODE <fct>, `Recruitment
#> #   Date` <date>, `1st Year Since Recruitment` <date>, `2nd Year Since
#> #   Recruitment` <date>, `3rd Year Since Recruitment` <date>

Created on 2019-02-13 by the reprex package (v0.2.1)

There was also a syntax error in between though that is not relevant now:

filter(salesdate %>% between(`Recruitment Date`, `1st Year Since Recruitment`))

Upvotes: 2

Related Questions