MSasta
MSasta

Reputation: 86

Writing R Function that finds difference in two Data Frames

my issue is basically this: I have 2 Datasets Booking Data (BD) and Order Data (OD). BD represents orders that have been booked but not paid for/confirmed whereas OD only has records for confirmed orders (and consequently they have an Order_ID). Hence there are more records in BD than in OD.

Both of them have roughly the same structure except for the fact that the OD table has another column called Order_ID. I created a new table from both data frames which look like this:

OD
Order_ID Departure_date      created_at_date     TF_PP
     <dbl> <dttm>              <dttm>              <dbl>
1   792251 2021-07-17 00:00:00 2021-07-02 00:00:00  9045
2   792563 2021-07-17 00:00:00 2021-07-02 00:00:00  9045
3   794073 2021-07-17 00:00:00 2021-07-03 00:00:00  7524
4   795797 2021-07-17 00:00:00 2021-07-03 00:00:00  9045
5   796617 2021-07-17 00:00:00 2021-07-04 00:00:00  9045
6   797848 2021-07-17 00:00:00 2021-07-04 00:00:00  9045
BD       
Departure_date         created_at_date     TF_PP
1:     2021-07-17      2021-07-02          9045
2:     2021-07-17      2021-07-02          9045
3:     2021-07-17      2021-07-02          9045
4:     2021-07-17      2021-07-03          9045
5:     2021-07-17      2021-07-03          7524
6:     2021-07-17      2021-07-03          9045
7:     2021-07-17      2021-07-03          9045
8:     2021-07-17      2021-07-04          5142
9:     2021-07-17      2021-07-04          9045
10:    2021-07-17      2021-07-04          10000

PROBLEM

The issue I am facing is that I want to write a function, which for EACH Order_ID in OD, takes the created_at_date and the respective value of TF_PP and finds a TF_PP in BD which is lower than OD's TF_PP by > 2000 AND is after the created_at_date AND shares the same departure_date of the Order_ID in OD. Then the function returns a new dataframe which has the difference of both (difference being > 2000) as well as an Order_ID, Departure_Date and created_at_date column. However, if the diff < 2000 then the function does not return that in the new dataframe.

Output:
Order_ID    Dep_Date    OD(Created_at_Date) TF  BD(Created_at_Date) TF   Diff
766787      2021-07-17   2021-07-02        9040     2021-07-04     6950  2090
766787      2021-07-17   2021-07-02        9040     2021-07-12     6895  2145
839265      2021-08-20   2021-08-08        12987    2021-08-15     10000 2987

If my question is confusion or needs further clarification please do let me know

EDIT dput output

OD:


structure(list(Order_ID = c(792251, 792563, 794073, 795797, 796617, 
797848, 798374, 798990, 801121, 801643, 808494, 809900, 810710, 
814812, 815040, 815257, 817469, 819219), Departure_date = structure(c(1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000, 
1625270400, 1625270400, 1625356800, 1625356800, 1625356800, 1625443200, 
1625443200, 1625529600, 1625702400, 1625788800, 1625788800, 1625961600, 
1625961600, 1625961600, 1626048000, 1626048000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), TF_PP = c(9045, 9045, 7524, 9045, 9045, 9045, 9045, 
11245, 9045, 11245, 12945, 12945, 12945, 12945, 12945, 12945, 
14945, 14945)), row.names = c(NA, -18L), groups = structure(list(
    Order_ID = c(792251, 792563, 794073, 795797, 796617, 797848, 
    798374, 798990, 801121, 801643, 808494, 809900, 810710, 814812, 
    815040, 815257, 817469, 819219), .rows = structure(list(1L, 
        2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
        15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -18L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

BD:


structure(list(Departure_date = structure(c(1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 1626480000, 
1626480000, 1626480000, 1626480000, 1626480000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), created_at_date = structure(c(1625184000, 1625184000, 
1625184000, 1625270400, 1625270400, 1625270400, 1625270400, 1625356800, 
1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 1625356800, 
1625356800, 1625443200, 1625443200, 1625443200, 1625529600, 1625529600, 
1625529600, 1625529600, 1625529600, 1625529600, 1625702400, 1625702400, 
1625702400, 1625702400, 1625702400, 1625702400), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), TF_PP = c(9045, 9045, 9045, 9045, 7524, 9045, 9045, 
9045, 9045, 9045, 9045, 9045, 9045, 9045, 9045, 11245, 9045, 
9045, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 11245, 
11245, 12945, 12945, 12945)), row.names = c(NA, -30L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000021d05e81ef0>)
'

Upvotes: 1

Views: 81

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18541

Although this question is tagged with {dplyr}, this kind of unequal join is easier done using {data.table}:

library(data.table)

OD <- as.data.table(OD)
BD <- as.data.table(BD)

OD[BD, 
     on = .(created_at_date < created_at_date, Departure_date = Departure_date)
   ][
     ,`:=`("diff" = i.TF_PP - TF_PP)
   ][
     diff > 2000]

#>      Order_ID Departure_date created_at_date TF_PP i.TF_PP diff
#>   1:   792251     2021-07-17      2021-07-05  9045   11245 2200
#>   2:   792563     2021-07-17      2021-07-05  9045   11245 2200
#>   3:   794073     2021-07-17      2021-07-05  7524   11245 3721
#>   4:   795797     2021-07-17      2021-07-05  9045   11245 2200
#>   5:   796617     2021-07-17      2021-07-05  9045   11245 2200
#>  ---                                                           
#>  99:   795797     2021-07-17      2021-07-08  9045   12945 3900
#> 100:   796617     2021-07-17      2021-07-08  9045   12945 3900
#> 101:   797848     2021-07-17      2021-07-08  9045   12945 3900
#> 102:   798374     2021-07-17      2021-07-08  9045   12945 3900
#> 103:   801121     2021-07-17      2021-07-08  9045   12945 3900

Created on 2021-08-27 by the reprex package (v2.0.1)

With {dplyr} we would use a full_join, then create diff and then do the filtering:

library(dplyr)

OD %>% 
  full_join(BD, by = "Departure_date") %>% 
  mutate(diff = TF_PP.y - TF_PP.x) %>% 
  filter(created_at_date.x < created_at_date.y,
         diff > 2000)

#> # A tibble: 103 x 7
#> # Groups:   Order_ID [8]
#>    Order_ID Departure_date      created_at_date.x   TF_PP.x created_at_date.y  
#>       <dbl> <dttm>              <dttm>                <dbl> <dttm>             
#>  1   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-05 00:00:00
#>  2   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  3   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  4   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  5   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  6   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  7   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-06 00:00:00
#>  8   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#>  9   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#> 10   792251 2021-07-17 00:00:00 2021-07-02 00:00:00    9045 2021-07-08 00:00:00
#> # ... with 93 more rows, and 2 more variables: TF_PP.y <dbl>, diff <dbl>

Created on 2021-08-27 by the reprex package (v2.0.1)

As pointed out in the comments, with the {data.table} approach only keeps the created_at_date of BD. To keep the created_at_date of the OD table we need to assign it a different name first:

OD[, "create_at_date_OD" := created_at_date
][BD,  
   on = .(created_at_date < created_at_date, Departure_date = Departure_date)
][
  ,`:=`("diff" = i.TF_PP - TF_PP)
][
  diff > 2000]

#>      Order_ID Departure_date created_at_date TF_PP create_at_date_OD i.TF_PP
#>   1:   792251     2021-07-17      2021-07-05  9045        2021-07-02   11245
#>   2:   792563     2021-07-17      2021-07-05  9045        2021-07-02   11245
#>   3:   794073     2021-07-17      2021-07-05  7524        2021-07-03   11245
#>   4:   795797     2021-07-17      2021-07-05  9045        2021-07-03   11245
#>   5:   796617     2021-07-17      2021-07-05  9045        2021-07-04   11245
#>  ---                                                                        
#>  99:   795797     2021-07-17      2021-07-08  9045        2021-07-03   12945
#> 100:   796617     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 101:   797848     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 102:   798374     2021-07-17      2021-07-08  9045        2021-07-04   12945
#> 103:   801121     2021-07-17      2021-07-08  9045        2021-07-05   12945
#>      diff
#>   1: 2200
#>   2: 2200
#>   3: 3721
#>   4: 2200
#>   5: 2200
#>  ---     
#>  99: 3900
#> 100: 3900
#> 101: 3900
#> 102: 3900
#> 103: 3900

Created on 2021-08-29 by the reprex package (v0.3.0)

Upvotes: 1

Nicolas2
Nicolas2

Reputation: 2210

Perhaps something like that, using pure SQL:

library(sqldf)
sqldf("
SELECT *,OD.TF_PP-BD.TF_PP as diff
FROM OD,BD
WHERE(OD.created_at_date>BD.created_at_date)
  AND(OD.TF_PP-BD.TF_PP>2000)
  AND(BD.Departure_date=OD.Departure_date)")

Upvotes: 0

Related Questions