Reputation: 13
I have two data frames: summary and hauled. I want to create a new column in hauled that returns the values from summary$rfp_id based on 2 matching criteria:
summary dataframe:
rfp_id | start_date | end_date | company_id |
---|---|---|---|
1 | 12/30/2022 | 2/28/2023 | 7 |
2 | 4/1/2022 | 6/30/2022 | 8 |
3 | 7/1/2022 | 8/30/2022 | 8 |
4 | 1/16/2022 | 1/16/2023 | 9 |
5 | 1/1/2023 | 2/6/2023 | 9 |
hauled dataframe (rfp_id = desired result):
trans# | company_id | trans_date | rfp_id |
---|---|---|---|
11 | 7 | 1/14/2023 | 1 |
12 | 8 | 7/2/2022 | 3 |
13 | 8 | 3/20/2022 | 2 |
14 | 8 | 9/1/2022 | 3 |
15 | 9 | 1/15/2023 | 5 |
The first example (trans# = 11) returns rfp_id = 1 since company 7 only appears once in summary and hauled$trans_date of 1/14/2023 falls between the start/end dates of 12/30/2022 and 2/28/2023.
The second example (trans# = 12)returns rfp_id = 3 since the trans_date of 7/2/2022 falls between 7/1/2022-8/30/2022 (and not between 4/1/2022-6/30/2022).
The third example (trans# = 13 )returns rfp_id = 2 because the trans_date of 3/20/2022 falls outside both start/end date ranges, however it is closest to the start_date of 4/1/2022
The fourth example (trans# = 14 )returns rfp_id = 3 because the trans_date of 9/1/2022 falls outside both start/end date ranges, however it is closest to the end_date of 8/30/2022.
The fifth example (trans# = 15) returns rfp_id = 5 because when a trans_date falls between 2 or more start/end date ranges then the rfp_id to return is whichever has the latest end_date
I don't have a strong background in R. Most of what I have tried has been from chatgpt. The code it spits out continuously throws off a 'many-to-many' error when executed.
Upvotes: 0
Views: 82
Reputation: 3081
OK, this works but it's early here and I'm one hundred percent sure there is a MUCH more elegant and less laborious approach to this, I'm just not seeing it (yet). Maybe after coffee:
library(tidyverse)
# you can just load the required packages when you streamline this code, but I'm lazy.
# Preparing data from OP
summary_df <- tribble(
~rfp_id, ~start_date, ~end_date, ~company_id,
1, "12/30/2022", "2/28/2023", 7,
2, "4/1/2022", "6/30/2022", 8,
3, "7/1/2022", "8/30/2022", 8,
4, "1/16/2022", "1/16/2023", 9,
5, "1/1/2023", "2/6/2023", 9
)
summary_df <- summary_df |> mutate(across(ends_with("_date"), ~ as.Date(.x, format = "%m/%d/%Y")))
hauled_df <- tribble( # note I used trans_no instead of trans# to avoid having to enclose it in back ticks each time.
~trans_no, ~company_id, ~trans_date,
11, 7, "1/14/2023",
12, 8, "7/2/2022",
13, 8, "3/20/2022",
14, 8, "9/1/2022",
15, 9, "1/15/2023",
)
hauled_df$trans_date <- as.Date(hauled_df$trans_date, format = "%m/%d/%Y")
I have made the assumption that there will ALWAYS be at least one match of company-id - if that's wrong, add a test here. I also assumed that if outside of all ranges, you wanted the nearest range border (start OR finish).
hauled_df$rfp_id <- unlist(pmap(.l = hauled_df,
.f = \(trans_no, company_id, trans_date) {
com_id = company_id
summary_df |>
filter(company_id == com_id) |>
mutate(score = case_when(
trans_date >= start_date & trans_date <= end_date ~ 0,
trans_date < start_date ~ as.integer(start_date - trans_date),
.default = as.integer(trans_date - end_date)
)) |>
filter(score == min(score)) |>
filter(end_date == max(end_date)) |>
select(rfp_id)
}
))
output:
> hauled_df
# A tibble: 5 × 4
trans_no company_id trans_date rfp_id
<dbl> <dbl> <date> <dbl>
1 11 7 2023-01-14 1
2 12 8 2022-07-02 3
3 13 8 2022-03-20 2
4 14 8 2022-09-01 3
5 15 9 2023-01-15 5
Explanation:
pmap(.l = hauled_df,
pmap
is parallel map - basically it applies the function to each row of hauled df, feeding that row's values for trans_no, company_id, trans_date
into the function:
.f = \(trans_no, company_id, trans_date) {
com_id = company_id # almost certainly unnecessary
summary_df |>
filter(company_id == com_id) |>
take a copy of the summary_df
and filter out all the rows without matching company IDs.
We're then going to score each remaining row - lower is better/higher priority.
Inside the date range scores 0. Outside it scores x where x is the number of days before or after the edge of the range.
mutate(score = case_when(
trans_date >= start_date & trans_date <= end_date ~ 0, # inside the date range
trans_date < start_date ~ as.integer(start_date - trans_date), # before the date range
.default = as.integer(trans_date - end_date) # after the date range
)) |>
We then need to firstly take the best (lowest) score
, then do any tie breaks on the latest end_date
and extract the rfp_id
from the one remaining row:
filter(score == min(score)) |>
filter(end_date == max(end_date)) |>
select(rfp_id)
}
Upvotes: 3