Reputation: 86
my issue is basically this: I have 2 Datasets Booking Data (BD) and Order Data (OD). BD represents orders that have been booked but not paid for/confirmed whereas OD only has records for confirmed orders (and consequently they have an Order_ID). Hence there are more records in BD than in OD.
Both of them have roughly the same structure except for the fact that the OD table has another column called Order_ID. I created a new table from both data frames which look like this:
OD
Order_ID Departure_date created_at_date TF_PP
<dbl> <dttm> <dttm> <dbl>
1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
2 792563 2021-07-17 00:00:00 2021-07-02 00:00:00 9045
3 794073 2021-07-17 00:00:00 2021-07-03 00:00:00 7524
4 795797 2021-07-17 00:00:00 2021-07-03 00:00:00 9045
5 796617 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
6 797848 2021-07-17 00:00:00 2021-07-04 00:00:00 9045
BD
Departure_date created_at_date TF_PP
1: 2021-07-17 2021-07-02 9045
2: 2021-07-17 2021-07-02 9045
3: 2021-07-17 2021-07-02 9045
4: 2021-07-17 2021-07-03 9045
5: 2021-07-17 2021-07-03 7524
6: 2021-07-17 2021-07-03 9045
7: 2021-07-17 2021-07-03 9045
8: 2021-07-17 2021-07-04 5142
9: 2021-07-17 2021-07-04 9045
10: 2021-07-17 2021-07-04 10000
PROBLEM
The issue I am facing is that I want to write a function, which for EACH Order_ID in OD, takes the created_at_date and the respective value of TF_PP and finds a TF_PP in BD which is lower than OD's TF_PP by > 2000 AND is after the created_at_date AND shares the same departure_date of the Order_ID in OD. Then the function returns a new dataframe which has the difference of both (difference being > 2000) as well as an Order_ID, Departure_Date and created_at_date column. However, if the diff < 2000 then the function does not return that in the new dataframe.
Output:
Order_ID Dep_Date OD(Created_at_Date) TF BD(Created_at_Date) TF Diff
766787 2021-07-17 2021-07-02 9040 2021-07-04 6950 2090
766787 2021-07-17 2021-07-02 9040 2021-07-12 6895 2145
839265 2021-08-20 2021-08-08 12987 2021-08-15 10000 2987
If my question is confusion or needs further clarification please do let me know
EDIT dput output
OD:
structure(list(Order_ID = c(792251, 792563, 794073, 795797, 796617,
797848, 798374, 798990, 801121, 801643, 808494, 809900, 810710,
814812, 815040, 815257, 817469, 819219), Departure_date = structure(c(1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625270400, 1625270400, 1625356800, 1625356800, 1625356800, 1625443200,
1625443200, 1625529600, 1625702400, 1625788800, 1625788800, 1625961600,
1625961600, 1625961600, 1626048000, 1626048000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 7524, 9045, 9045, 9045, 9045,
11245, 9045, 11245, 12945, 12945, 12945, 12945, 12945, 12945,
14945, 14945)), row.names = c(NA, -18L), groups = structure(list(
Order_ID = c(792251, 792563, 794073, 795797, 796617, 797848,
798374, 798990, 801121, 801643, 808494, 809900, 810710, 814812,
815040, 815257, 817469, 819219), .rows = structure(list(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
BD:
structure(list(Departure_date = structure(c(1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000,
1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000,
1625184000, 1625270400, 1625270400, 1625270400, 1625270400, 1625356800,
1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 1625356800,
1625356800, 1625443200, 1625443200, 1625443200, 1625529600, 1625529600,
1625529600, 1625529600, 1625529600, 1625529600, 1625702400, 1625702400,
1625702400, 1625702400, 1625702400, 1625702400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), TF_PP = c(9045, 9045, 9045, 9045, 7524, 9045, 9045,
9045, 9045, 9045, 9045, 9045, 9045, 9045, 9045, 11245, 9045,
9045, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 11245,
11245, 12945, 12945, 12945)), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000021d05e81ef0>)
'
Upvotes: 1
Views: 81
Reputation: 18541
Although this question is tagged with {dplyr}, this kind of unequal join is easier done using {data.table}:
library(data.table)
OD <- as.data.table(OD)
BD <- as.data.table(BD)
OD[BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP i.TF_PP diff
#> 1: 792251 2021-07-17 2021-07-05 9045 11245 2200
#> 2: 792563 2021-07-17 2021-07-05 9045 11245 2200
#> 3: 794073 2021-07-17 2021-07-05 7524 11245 3721
#> 4: 795797 2021-07-17 2021-07-05 9045 11245 2200
#> 5: 796617 2021-07-17 2021-07-05 9045 11245 2200
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 12945 3900
#> 100: 796617 2021-07-17 2021-07-08 9045 12945 3900
#> 101: 797848 2021-07-17 2021-07-08 9045 12945 3900
#> 102: 798374 2021-07-17 2021-07-08 9045 12945 3900
#> 103: 801121 2021-07-17 2021-07-08 9045 12945 3900
Created on 2021-08-27 by the reprex package (v2.0.1)
With {dplyr} we would use a full_join
, then create diff
and then do the filtering:
library(dplyr)
OD %>%
full_join(BD, by = "Departure_date") %>%
mutate(diff = TF_PP.y - TF_PP.x) %>%
filter(created_at_date.x < created_at_date.y,
diff > 2000)
#> # A tibble: 103 x 7
#> # Groups: Order_ID [8]
#> Order_ID Departure_date created_at_date.x TF_PP.x created_at_date.y
#> <dbl> <dttm> <dttm> <dbl> <dttm>
#> 1 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-05 00:00:00
#> 2 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 3 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 4 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 5 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 6 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 7 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-06 00:00:00
#> 8 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 9 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> 10 792251 2021-07-17 00:00:00 2021-07-02 00:00:00 9045 2021-07-08 00:00:00
#> # ... with 93 more rows, and 2 more variables: TF_PP.y <dbl>, diff <dbl>
Created on 2021-08-27 by the reprex package (v2.0.1)
As pointed out in the comments, with the {data.table} approach only keeps the created_at_date
of BD
. To keep the created_at_date
of the OD
table we need to assign it a different name first:
OD[, "create_at_date_OD" := created_at_date
][BD,
on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
,`:=`("diff" = i.TF_PP - TF_PP)
][
diff > 2000]
#> Order_ID Departure_date created_at_date TF_PP create_at_date_OD i.TF_PP
#> 1: 792251 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 2: 792563 2021-07-17 2021-07-05 9045 2021-07-02 11245
#> 3: 794073 2021-07-17 2021-07-05 7524 2021-07-03 11245
#> 4: 795797 2021-07-17 2021-07-05 9045 2021-07-03 11245
#> 5: 796617 2021-07-17 2021-07-05 9045 2021-07-04 11245
#> ---
#> 99: 795797 2021-07-17 2021-07-08 9045 2021-07-03 12945
#> 100: 796617 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 101: 797848 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 102: 798374 2021-07-17 2021-07-08 9045 2021-07-04 12945
#> 103: 801121 2021-07-17 2021-07-08 9045 2021-07-05 12945
#> diff
#> 1: 2200
#> 2: 2200
#> 3: 3721
#> 4: 2200
#> 5: 2200
#> ---
#> 99: 3900
#> 100: 3900
#> 101: 3900
#> 102: 3900
#> 103: 3900
Created on 2021-08-29 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 2210
Perhaps something like that, using pure SQL:
library(sqldf)
sqldf("
SELECT *,OD.TF_PP-BD.TF_PP as diff
FROM OD,BD
WHERE(OD.created_at_date>BD.created_at_date)
AND(OD.TF_PP-BD.TF_PP>2000)
AND(BD.Departure_date=OD.Departure_date)")
Upvotes: 0