mr analyst
mr analyst

Reputation: 79

Joining two dataframe df1 and df2 where date in df2 should be first date after date in df1 i.e earliest date after date in df1

--Problem statement description :- I have two df - df1 and df2 . df1 contains data of quantity buy of id=7 and df2 contain data of quantity sold of same id=7. Both df1 and df2 is sorted according to date i.e from oldest to latest date and every time only 1 quantity is bought or sold.

--df1

date_buy id qty_buy rolling_sum_qty_buy
30-07-2019 7 1 1
20-10-2019 7 1 2
17-01-2020 7 1 3
15-02-2020 7 1 4
15-02-2020 7 1 5
15-02-2020 7 1 6
14-07-2021 7 1 7
19-09-2021 7 1 8
25-12-2021 7 1 9
30-12-2021 7 1 10
10-02-2022 7 1 11
15-03-2022 7 1 12
15-03-2022 7 1 13
14-06-2022 7 1 14

--df2

date_sold id qty_sold rolling_sum_qty_sold
01-08-2019 7 1 1
15-09-2019 7 1 2
27-12-2019 7 1 3
01-02-2020 7 1 4
12-02-2020 7 1 5
25-07-2021 7 1 6
25-07-2021 7 1 7
28-08-2021 7 1 8
10-09-2021 7 1 9
12-09-2021 7 1 10
25-04-2022 7 1 11

-- What i want Now, i want to join this two dataframe df1 and df2 on two condition

  1. for every date i.e date_buy column in df1 i should get output where date i.e date_sold is greater than date_buy and i want first date i.e. date_sold which is greater than that particular date i.e date_buy.

  2. i also want those rows from df1 in my output which does not get joined with df2 so that i can easily find out the remaining quantity because in df1 i have quantity buy and after joining with df2 i will get quantity sold, so the cases where i get null values then in that case i can assume that that much quantity is remaining.

--My output:- Earlier when there was no datecondition then i was simply using left join i.e merging both df to join df1 and df2 on rolling sum condition and where there was null cases i was taking sum of qty to get remaining qty but right now i have that date condition too so i cant use rolling_sum_cond column directly in join condition.

-- code ''' df3= df1.merge(df2,how='left',left_on=['rolling_sum_qty_buy'],right_on=['rolling_sum_qty_sold']) '''

-- output which i was getting without using any date condition

date_buy id qty_buy rolling_sum_qty_buy date_sold id-2 qty_sold rolling_sum_qty_sold
30-07-2019 7 1 1 01-08-2019 7 1 1
20-10-2019 7 1 2 15-09-2019 7 1 2
17-01-2020 7 1 3 27-12-2019 7 1 3
15-02-2020 7 1 4 01-02-2020 7 1 4
15-02-2020 7 1 5 12-02-2020 7 1 5
15-02-2020 7 1 6 25-07-2021 7 1 6
14-07-2021 7 1 7 25-07-2021 7 1 7
19-09-2021 7 1 8 28-08-2021 7 1 8
25-12-2021 7 1 9 10-09-2021 7 1 9
30-12-2021 7 1 10 12-09-2021 7 1 10
10-02-2022 7 1 11 25-04-2022 7 1 11
15-03-2022 7 1 12 Nan Nan Nan Nan
15-03-2022 7 1 13 Nan Nan Nan Nan
14-06-2022 7 1 14 Nan Nan Nan Nan

--**Now i have to use date condition also to get the required output

--EXPECTED OUTPUT

date_buy id qty_buy rolling_sum_qty_buy date_sold id qty_sold rolling_sum_qty_sold
30-07-2019 7 1 1 01-08-2019 7 1 1
20-10-2019 7 1 2 27-12-2019 7 1 3
17-01-2020 7 1 3 01-02-2020 7 1 4
15-02-2020 7 1 4 25-07-2021 7 1 6
15-02-2020 7 1 5 25-07-2021 7 1 7
15-02-2020 7 1 6 28-08-2021 7 1 8
14-07-2021 7 1 7 10-09-2021 7 1 9
19-09-2021 7 1 8 25-04-2022 7 1 11
25-12-2021 7 1 9 NaN NaN NaN NaN
30-12-2021 7 1 10 NaN NaN NaN NaN
10-02-2022 7 1 11 NaN NaN NaN NaN
15-03-2022 7 1 12 NaN NaN NaN NaN
15-03-2022 7 1 13 NaN NaN NaN NaN
14-06-2022 7 1 14 NaN NaN NaN NaN

-- Please help me to get the following output. Any help would be appreciated. We can also use any loop or can also define any custom function or can create any extra column if required to get desired output.

Upvotes: 0

Views: 79

Answers (1)

maow
maow

Reputation: 2887

I cannot get the complete answer, but it might help a little:

pd.merge_asof is essentially merging on the closest datetime. If you format you DataFrame as datetime64[ns]

df1["date_sold"] = pd.to_datetime(df1["date_sold"], format="%d-%m-%Y", exact=False)
df2["date_sold"] = pd.to_datetime(df2["date_sold"], format="%d-%m-%Y", exact=False)

you can merge them like this.

Out[32]: 
     date_buy  id_x  qty_buy  rolling_sum_qty_buy  date_sold  id_y  qty_sold  rolling_sum_qty_sold
0  2019-07-30     7        1                    1 2019-08-01   7.0       1.0                   1.0
1  2019-10-20     7        1                    2 2019-12-27   7.0       1.0                   3.0
2  2020-01-17     7        1                    3 2020-02-01   7.0       1.0                   4.0
3  2020-02-15     7        1                    4 2021-07-25   7.0       1.0                   6.0
4  2020-02-15     7        1                    5 2021-07-25   7.0       1.0                   6.0
5  2020-02-15     7        1                    6 2021-07-25   7.0       1.0                   6.0
6  2021-07-14     7        1                    7 2021-07-25   7.0       1.0                   6.0
7  2021-09-19     7        1                    8 2022-04-25   7.0       1.0                  11.0
8  2021-12-25     7        1                    9 2022-04-25   7.0       1.0                  11.0
9  2021-12-30     7        1                   10 2022-04-25   7.0       1.0                  11.0
10 2022-02-10     7        1                   11 2022-04-25   7.0       1.0                  11.0
11 2022-03-15     7        1                   12 2022-04-25   7.0       1.0                  11.0
12 2022-03-15     7        1                   13 2022-04-25   7.0       1.0                  11.0
13 2022-06-14     7        1                   14        NaT   NaN       NaN                   NaN

This will match every buy with the first sold. It is capable of e.g. skipping the sale on 2019-09-15. However by definition it will do a left join and therefore just repeat elements if they are matching multiple. So instead of rolling_sum_qty_sold = 6, 7, 8, 9 it just gives 6, 6, 6, 6 and instead of 11, NaN, NaN, ... it just keeps repeating 11.

Hopefully this still helps you get on the right track :)

Upvotes: 1

Related Questions