Reputation: 661
I am working on a project to find the number of times a service has to be re-worked but only within the first 30 days of the end of the original service. Using the sample datasets below:
work <- tribble(
~Client , ~Initial_Date_In, ~Initial_Date_Out,
'Car1', '2019-01-01', '2019-01-02',
'Car1', '2019-06-01', '2019-06-03',
'Car2', '2019-01-01', '2019-01-02',
'Truck1', '2019-06-01', '2019-06-02',
'Truck2', '2019-04-01', '2019-04-02',
'Truck3', '2019-08-01', '2019-08-04',
'Van1', '2019-01-01', '2019-01-02',
'Van2', '2019-06-01', '2019-06-02',
'Truck4', '2019-10-05', '2019-10-09',
)
rework <- tribble(
~Client , ~Date_In, ~Date_Out,
'Car1', '2019-01-05', '2019-01-07',
'Car1', '2019-01-09', '2019-01-10',
'Car1', '2019-06-09', '2019-06-10',
'Truck3', '2019-08-07', '2019-08-08',
'Van1', '2019-03-01', '2019-03-01',
'Van2', '2019-06-09', '2019-06-11',
'Truck4', '2019-11-01', '2019-11-02',
)
work$Initial_Date_In <- lubridate::as_date(work$Initial_Date_In)
work$Initial_Date_Out <- lubridate::as_date(work$Initial_Date_Out)
rework$Date_In <- lubridate::as_date(rework$Date_In)
rework$Date_Out <- lubridate::as_date(rework$Date_Out)
I tried multiple things, such as :
work %>%
left_join(select(rework, Client,Date_Out), by = ("Client" = "Client")) %>%
group_by(Client) %>%
filter(max(Date_Out) >Initial_Date_Out & max(Date_Out) < (Initial_Date_Out+duration(30,"days")))
This code seems to pull the max Initial_Date_In(Out), but I want all from the work table where there is an appropriate entry in rework (displaying on the max Date_Out from the rework table)
The expected output would be something like:
Client Initial_Date_In Initial_Date_Out Date_Out #_Of_Rework_Visits
Car1 2019-01-01 2019-01-02 2019-01-10 2
Car1 2019-06-01 2019-06-03 2019-06-10 1
Truck3 2019-08-01 2019-08-04 2019-08-08 1
Van1 2019-06-01 2019-06-02 2019-06-11 1
Truck4 2019-10-05 2019-10-09 2019-11-02 1
Any tips on what I am doing wrong?
Upvotes: 1
Views: 114
Reputation: 21264
Update
Here's an approach than avoids doing any kind of formal join.
Stack work
and rework
with bind_rows
, then use a service_ix
column (index for service appts within a given Client
) as a grouping/summarizing variable:
work %>%
mutate(service = "original") %>%
rename(Date_In = "Initial_Date_In", Date_Out = "Initial_Date_Out") %>%
group_by(Client) %>%
mutate(service_ix = row_number()) %>%
bind_rows(rework %>%
mutate(service = "rework") %>%
select(Client, Date_In, Date_Out, service)) %>%
arrange(Client, Date_In) %>%
fill(service_ix) %>%
group_by(Client, service_ix) %>%
summarise(initial_date_in = Date_In[service == "original"],
initial_date_out = Date_Out[service == "original"],
max_date_out = max(Date_Out),
n_reworks = sum(service == "rework")) %>%
filter(n_reworks > 0) %>%
ungroup()
Unpacking a bit:
service
column indicating initial vs rework. Here's the combined df after bind_rows()
:# A tibble: 16 x 5
# Groups: Client [8]
Client Date_In Date_Out service service_ix
<chr> <date> <date> <chr> <int>
1 Car1 2019-01-01 2019-01-02 original 1
2 Car1 2019-01-05 2019-01-07 rework NA
3 Car1 2019-01-09 2019-01-10 rework NA
4 Car1 2019-06-01 2019-06-03 original 2
5 Car1 2019-06-09 2019-06-10 rework NA
6 Car2 2019-01-01 2019-01-02 original 1
7 Truck1 2019-06-01 2019-06-02 original 1
8 Truck2 2019-04-01 2019-04-02 original 1
9 Truck3 2019-08-01 2019-08-04 original 1
10 Truck3 2019-08-07 2019-08-08 rework NA
11 Truck4 2019-10-05 2019-10-09 original 1
12 Truck4 2019-11-01 2019-11-02 rework NA
13 Van1 2019-01-01 2019-01-02 original 1
14 Van1 2019-03-01 2019-03-01 rework NA
15 Van2 2019-06-01 2019-06-02 original 1
16 Van2 2019-06-09 2019-06-11 rework NA
fill
step replaces the service_ix
NA values: Client Date_In Date_Out service service_ix
<chr> <date> <date> <chr> <int>
1 Car1 2019-01-01 2019-01-02 original 1
2 Car1 2019-01-05 2019-01-07 rework 1
3 Car1 2019-01-09 2019-01-10 rework 1
4 Car1 2019-06-01 2019-06-03 original 2
5 Car1 2019-06-09 2019-06-10 rework 2
Client
and service_ix
, and summarize as needed.# A tibble: 6 x 6
Client service_ix initial_date_in initial_date_out max_date_out n_reworks
<chr> <int> <date> <date> <date> <int>
1 Car1 1 2019-01-01 2019-01-02 2019-01-10 2
2 Car1 2 2019-06-01 2019-06-03 2019-06-10 1
3 Truck3 1 2019-08-01 2019-08-04 2019-08-08 1
4 Truck4 1 2019-10-05 2019-10-09 2019-11-02 1
5 Van1 1 2019-01-01 2019-01-02 2019-03-01 1
6 Van2 1 2019-06-01 2019-06-02 2019-06-11 1
Upvotes: 1
Reputation: 30474
Here is something else to try, using the fuzzyjoin
package.
First, calculate a Date_End_Service
date which is 30 days after the Initial_Date_Out
. Then, you can do a fuzzy_inner_join
between work
and rework
, matching on Client
, and making sure the Initial_Date_Out
is <= Date_Out
in rework
, and the Date_End_Service
is past Date_Out
.
library(tidyverse)
library(lubridate)
library(fuzzyjoin)
work %>%
mutate(Date_End_Service = Initial_Date_Out + days(30)) %>%
fuzzy_inner_join(rework,
by = c("Client" = "Client", "Initial_Date_Out" = "Date_Out", "Date_End_Service" = "Date_Out"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(Client.x, Initial_Date_In) %>%
summarise(n_service = n())
Output
Client.x Initial_Date_In n_service
<chr> <date> <int>
1 Car1 2019-01-01 2
2 Car1 2019-06-01 1
3 Truck3 2019-08-01 1
4 Truck4 2019-10-05 1
5 Van2 2019-06-01 1
Upvotes: 1