MalcMalcMalc
MalcMalcMalc

Reputation: 13

how to left join/merge in R based on 2 criteria: numerical match and date-range evaluation

I have two data frames: summary and hauled. I want to create a new column in hauled that returns the values from summary$rfp_id based on 2 matching criteria:

  1. Match summary$company_id with hauled$company_id
  2. Compare hauled$trans_date with summary$start_date and summary$end_date and return whichever rfp_id most closely matches based on the below explanation of possible outcomes.

summary dataframe:

rfp_id start_date end_date company_id
1 12/30/2022 2/28/2023 7
2 4/1/2022 6/30/2022 8
3 7/1/2022 8/30/2022 8
4 1/16/2022 1/16/2023 9
5 1/1/2023 2/6/2023 9

hauled dataframe (rfp_id = desired result):

trans# company_id trans_date rfp_id
11 7 1/14/2023 1
12 8 7/2/2022 3
13 8 3/20/2022 2
14 8 9/1/2022 3
15 9 1/15/2023 5

The first example (trans# = 11) returns rfp_id = 1 since company 7 only appears once in summary and hauled$trans_date of 1/14/2023 falls between the start/end dates of 12/30/2022 and 2/28/2023.

The second example (trans# = 12)returns rfp_id = 3 since the trans_date of 7/2/2022 falls between 7/1/2022-8/30/2022 (and not between 4/1/2022-6/30/2022).

The third example (trans# = 13 )returns rfp_id = 2 because the trans_date of 3/20/2022 falls outside both start/end date ranges, however it is closest to the start_date of 4/1/2022

The fourth example (trans# = 14 )returns rfp_id = 3 because the trans_date of 9/1/2022 falls outside both start/end date ranges, however it is closest to the end_date of 8/30/2022.

The fifth example (trans# = 15) returns rfp_id = 5 because when a trans_date falls between 2 or more start/end date ranges then the rfp_id to return is whichever has the latest end_date 

I don't have a strong background in R. Most of what I have tried has been from chatgpt. The code it spits out continuously throws off a 'many-to-many' error when executed.

Upvotes: 0

Views: 82

Answers (1)

PGSA
PGSA

Reputation: 3081

OK, this works but it's early here and I'm one hundred percent sure there is a MUCH more elegant and less laborious approach to this, I'm just not seeing it (yet). Maybe after coffee:

library(tidyverse)
# you can just load the required packages when you streamline this code, but I'm lazy.

# Preparing data from OP
summary_df <- tribble(
  ~rfp_id,  ~start_date,    ~end_date,  ~company_id,
  1,    "12/30/2022",   "2/28/2023",    7,
  2,    "4/1/2022", "6/30/2022",    8,
  3,    "7/1/2022", "8/30/2022",    8,
  4,    "1/16/2022",    "1/16/2023",    9,
  5,    "1/1/2023", "2/6/2023", 9
)

summary_df <- summary_df |> mutate(across(ends_with("_date"), ~ as.Date(.x, format = "%m/%d/%Y")))

hauled_df <- tribble( # note I used trans_no instead of trans# to avoid having to enclose it in back ticks each time.
  ~trans_no,    ~company_id,    ~trans_date,
  11,   7,  "1/14/2023",
  12,   8,  "7/2/2022",
  13,   8,  "3/20/2022",
  14,   8,  "9/1/2022",
  15,   9,  "1/15/2023",
)

hauled_df$trans_date <- as.Date(hauled_df$trans_date, format = "%m/%d/%Y")

I have made the assumption that there will ALWAYS be at least one match of company-id - if that's wrong, add a test here. I also assumed that if outside of all ranges, you wanted the nearest range border (start OR finish).

hauled_df$rfp_id <- unlist(pmap(.l = hauled_df,
                             .f = \(trans_no,   company_id, trans_date) {
                               com_id = company_id
                               summary_df |>
                                 filter(company_id == com_id) |>
                                 mutate(score = case_when(
                                   trans_date >= start_date & trans_date <= end_date ~ 0,
                                   trans_date < start_date ~ as.integer(start_date - trans_date),
                                   .default = as.integer(trans_date - end_date)
                                 )) |>
                                 filter(score == min(score)) |>
                                 filter(end_date == max(end_date)) |>
                                 select(rfp_id)
                             }
))

output:

> hauled_df
# A tibble: 5 × 4
  trans_no company_id trans_date rfp_id
     <dbl>      <dbl> <date>      <dbl>
1       11          7 2023-01-14      1
2       12          8 2022-07-02      3
3       13          8 2022-03-20      2
4       14          8 2022-09-01      3
5       15          9 2023-01-15      5

Explanation:

pmap(.l = hauled_df,

pmap is parallel map - basically it applies the function to each row of hauled df, feeding that row's values for trans_no, company_id, trans_date into the function:

.f = \(trans_no, company_id, trans_date) {
     com_id = company_id # almost certainly unnecessary
     summary_df |>
     filter(company_id == com_id) |>

take a copy of the summary_df and filter out all the rows without matching company IDs. We're then going to score each remaining row - lower is better/higher priority. Inside the date range scores 0. Outside it scores x where x is the number of days before or after the edge of the range.

     mutate(score = case_when(
     trans_date >= start_date & trans_date <= end_date ~ 0, # inside the date range
     trans_date < start_date ~ as.integer(start_date - trans_date), # before the date range
     .default = as.integer(trans_date - end_date) # after the date range
                                 )) |>

We then need to firstly take the best (lowest) score, then do any tie breaks on the latest end_date and extract the rfp_id from the one remaining row:

                                 filter(score == min(score)) |>
                                 filter(end_date == max(end_date)) |>
                                 select(rfp_id)
                             }

Upvotes: 3

Related Questions