Reputation:
This problem involves mapping multiple columns while adhering to certain constraints. It looks very difficult to me, since I'm not sure how to unnest the region
column in cat
and allow matching the df
data.
Problem: I want to match three columns from the data frame df
to four columns on the data frame cat
.
The result should adhere to the following conditions:
loss_date
in df
should be >= StartDate
and <= EndDate
in cat
.line_of_business
in df
should match line_of_business
in cat
.region
in df
should match region
in cat
(note: region
in cat
needs to be unnested).If all three conditions are met, an output column should return logical TRUE
, otherwise it should return logical FALSE
.
DATA:
df
loss_date line_of_business region
10/8/2018 Property NB
1/31/2019 Commercial Property Line BC
8/1/2018 Auto AB
3/20/2019 Personal Property ON
11/10/2018 Homeowners Line Alberta
cat
Start Date End Date line_of_business region
11/1/2018 11/30/2018 Homeowners Line Alberta
10/1/2018 10/21/2018 Property AB,BC,MB,ATL,ON,PQ
4/9/2018 4/10/2018 Commercial Property ATL
4/26/2018 5/22/2018 Auto BC, AB, PQ, ON, ATL
OUTPUT:
loss_date line_of_business region output
10/8/2018 Property MB TRUE
1/31/2019 Commercial Property Line BC FALSE
8/1/2018 Auto AB FALSE
3/20/2019 Personal Property ON FALSE
11/10/2018 Homeowners Line Alberta TRUE
Thank you so much for help.
Upvotes: 4
Views: 267
Reputation: 102261
Here is a solution for base R
:
df$output <- apply(df, 1, function(v)
with(cat, any(difftime(as.Date(v[1],format="%m/%d/%Y"),as.Date(StartDate,format="%m/%d/%Y"))>=0 &
difftime(as.Date(v[1],format="%m/%d/%Y"),as.Date(EndDate,format="%m/%d/%Y"))<=0 &
v[2] == line_of_business &
grepl(v[3],region))))
yielding:
> df
loss_date line_of_business region output
1 10/8/2018 Property MB TRUE
2 1/31/2019 Commercial_Property_Line BC FALSE
3 8/1/2018 Auto AB FALSE
4 3/20/2019 Personal_Property ON FALSE
5 11/10/2018 Homeowners_Line Alberta TRUE
Upvotes: 1
Reputation: 389135
Here is one way using dplyr
and separate_rows
from tidyr
. We bring the comma-separated values from region
column into different rows and do a join with df
. Convert the date using as.Date
and check if loss_date
is between StartDate
and EndDate
and if the region
matches.
library(dplyr)
cat %>%
tidyr::separate_rows(region, sep = ",") %>%
right_join(df, by = "line_of_business") %>%
mutate_at(vars(ends_with("ate")), as.Date, "%m/%d/%Y") %>%
group_by(line_of_business) %>%
summarise(temp = any(region.x %in% region.y & loss_date >= StartDate &
loss_date <= EndDate)) %>%
left_join(df)
# line_of_business temp loss_date region
# <chr> <lgl> <chr> <chr>
#1 Auto FALSE 8/1/2018 AB
#2 Commercial_Property_Line FALSE 1/31/2019 BC
#3 Homeowners_Line TRUE 11/10/2018 Alberta
#4 Personal_Property FALSE 3/20/2019 ON
#5 Property TRUE 10/8/2018 MB
data
df <- structure(list(loss_date = c("10/8/2018", "1/31/2019", "8/1/2018",
"3/20/2019", "11/10/2018"), line_of_business = c("Property",
"Commercial_Property_Line", "Auto", "Personal_Property", "Homeowners_Line"
), region = c("MB", "BC", "AB", "ON", "Alberta")), row.names = c(NA,
-5L), class = "data.frame")
cat <- structure(list(StartDate = c("11/1/2018", "10/1/2018", "4/9/2018",
"4/26/2018"), EndDate = c("11/30/2018", "10/21/2018", "4/10/2018",
"5/22/2018"), line_of_business = c("Homeowners_Line", "Property",
"Commercial_Property", "Auto"), region = c("Alberta", "AB,BC,MB,ATL,ON,PQ",
"ATL", "BC,AB,PQ,ON,ATL")), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 3