user12434805
user12434805

Reputation:

Mapping multiple columns in R including a single column containing comma-separated values

This problem involves mapping multiple columns while adhering to certain constraints. It looks very difficult to me, since I'm not sure how to unnest the region column in cat and allow matching the df data.

Problem: I want to match three columns from the data frame df to four columns on the data frame cat.

The result should adhere to the following conditions:

  1. loss_date in dfshould be >= StartDate and <= EndDate in cat.
  2. line_of_business in df should match line_of_business in cat.
  3. region in df should match region in cat (note: region in cat needs to be unnested).

If all three conditions are met, an output column should return logical TRUE, otherwise it should return logical FALSE.

DATA:

df
loss_date   line_of_business            region
10/8/2018   Property                      NB
1/31/2019   Commercial Property Line      BC
8/1/2018    Auto                          AB
3/20/2019   Personal Property             ON
11/10/2018  Homeowners Line             Alberta

cat
Start Date  End Date    line_of_business            region
11/1/2018   11/30/2018  Homeowners Line            Alberta
10/1/2018   10/21/2018  Property                 AB,BC,MB,ATL,ON,PQ
4/9/2018    4/10/2018   Commercial Property         ATL
4/26/2018   5/22/2018   Auto                   BC, AB, PQ, ON, ATL

OUTPUT:

loss_date   line_of_business            region      output
10/8/2018   Property                    MB          TRUE
1/31/2019   Commercial Property Line    BC          FALSE
8/1/2018    Auto                        AB          FALSE
3/20/2019   Personal Property           ON          FALSE
11/10/2018  Homeowners Line          Alberta         TRUE

Thank you so much for help.

Upvotes: 4

Views: 267

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102261

Here is a solution for base R:

df$output <- apply(df, 1, function(v) 
        with(cat, any(difftime(as.Date(v[1],format="%m/%d/%Y"),as.Date(StartDate,format="%m/%d/%Y"))>=0 & 
                    difftime(as.Date(v[1],format="%m/%d/%Y"),as.Date(EndDate,format="%m/%d/%Y"))<=0 &
                    v[2] == line_of_business &
                    grepl(v[3],region))))

yielding:

> df
   loss_date         line_of_business  region output
1  10/8/2018                 Property      MB   TRUE
2  1/31/2019 Commercial_Property_Line      BC  FALSE
3   8/1/2018                     Auto      AB  FALSE
4  3/20/2019        Personal_Property      ON  FALSE
5 11/10/2018          Homeowners_Line Alberta   TRUE

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

Here is one way using dplyr and separate_rows from tidyr. We bring the comma-separated values from region column into different rows and do a join with df. Convert the date using as.Date and check if loss_date is between StartDate and EndDate and if the region matches.

library(dplyr)

cat %>%
  tidyr::separate_rows(region, sep = ",") %>%
  right_join(df, by = "line_of_business") %>%
  mutate_at(vars(ends_with("ate")), as.Date, "%m/%d/%Y") %>%
  group_by(line_of_business) %>%
  summarise(temp = any(region.x %in% region.y & loss_date >= StartDate & 
                       loss_date <= EndDate)) %>%
  left_join(df)

#  line_of_business         temp  loss_date  region 
#  <chr>                    <lgl> <chr>      <chr>  
#1 Auto                     FALSE 8/1/2018   AB     
#2 Commercial_Property_Line FALSE 1/31/2019  BC     
#3 Homeowners_Line          TRUE  11/10/2018 Alberta
#4 Personal_Property        FALSE 3/20/2019  ON     
#5 Property                 TRUE  10/8/2018  MB    

data

df <- structure(list(loss_date = c("10/8/2018", "1/31/2019", "8/1/2018", 
"3/20/2019", "11/10/2018"), line_of_business = c("Property", 
"Commercial_Property_Line", "Auto", "Personal_Property", "Homeowners_Line"
), region = c("MB", "BC", "AB", "ON", "Alberta")), row.names = c(NA, 
-5L), class = "data.frame") 

cat <- structure(list(StartDate = c("11/1/2018", "10/1/2018", "4/9/2018", 
"4/26/2018"), EndDate = c("11/30/2018", "10/21/2018", "4/10/2018", 
"5/22/2018"), line_of_business = c("Homeowners_Line", "Property", 
"Commercial_Property", "Auto"), region = c("Alberta", "AB,BC,MB,ATL,ON,PQ", 
"ATL", "BC,AB,PQ,ON,ATL")), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 3

Related Questions