Reputation: 79
--Problem statement description :- I have two df - df1 and df2 . df1 contains data of quantity buy of id=7 and df2 contain data of quantity sold of same id=7. Both df1 and df2 is sorted according to date i.e from oldest to latest date and every time only 1 quantity is bought or sold.
--df1
date_buy | id | qty_buy | rolling_sum_qty_buy |
---|---|---|---|
30-07-2019 | 7 | 1 | 1 |
20-10-2019 | 7 | 1 | 2 |
17-01-2020 | 7 | 1 | 3 |
15-02-2020 | 7 | 1 | 4 |
15-02-2020 | 7 | 1 | 5 |
15-02-2020 | 7 | 1 | 6 |
14-07-2021 | 7 | 1 | 7 |
19-09-2021 | 7 | 1 | 8 |
25-12-2021 | 7 | 1 | 9 |
30-12-2021 | 7 | 1 | 10 |
10-02-2022 | 7 | 1 | 11 |
15-03-2022 | 7 | 1 | 12 |
15-03-2022 | 7 | 1 | 13 |
14-06-2022 | 7 | 1 | 14 |
--df2
date_sold | id | qty_sold | rolling_sum_qty_sold |
---|---|---|---|
01-08-2019 | 7 | 1 | 1 |
15-09-2019 | 7 | 1 | 2 |
27-12-2019 | 7 | 1 | 3 |
01-02-2020 | 7 | 1 | 4 |
12-02-2020 | 7 | 1 | 5 |
25-07-2021 | 7 | 1 | 6 |
25-07-2021 | 7 | 1 | 7 |
28-08-2021 | 7 | 1 | 8 |
10-09-2021 | 7 | 1 | 9 |
12-09-2021 | 7 | 1 | 10 |
25-04-2022 | 7 | 1 | 11 |
-- What i want Now, i want to join this two dataframe df1 and df2 on two condition
for every date i.e date_buy column in df1 i should get output where date i.e date_sold is greater than date_buy and i want first date i.e. date_sold which is greater than that particular date i.e date_buy.
i also want those rows from df1 in my output which does not get joined with df2 so that i can easily find out the remaining quantity because in df1 i have quantity buy and after joining with df2 i will get quantity sold, so the cases where i get null values then in that case i can assume that that much quantity is remaining.
--My output:- Earlier when there was no datecondition then i was simply using left join i.e merging both df to join df1 and df2 on rolling sum condition and where there was null cases i was taking sum of qty to get remaining qty but right now i have that date condition too so i cant use rolling_sum_cond column directly in join condition.
-- code ''' df3= df1.merge(df2,how='left',left_on=['rolling_sum_qty_buy'],right_on=['rolling_sum_qty_sold']) '''
-- output which i was getting without using any date condition
date_buy | id | qty_buy | rolling_sum_qty_buy | date_sold | id-2 | qty_sold | rolling_sum_qty_sold |
---|---|---|---|---|---|---|---|
30-07-2019 | 7 | 1 | 1 | 01-08-2019 | 7 | 1 | 1 |
20-10-2019 | 7 | 1 | 2 | 15-09-2019 | 7 | 1 | 2 |
17-01-2020 | 7 | 1 | 3 | 27-12-2019 | 7 | 1 | 3 |
15-02-2020 | 7 | 1 | 4 | 01-02-2020 | 7 | 1 | 4 |
15-02-2020 | 7 | 1 | 5 | 12-02-2020 | 7 | 1 | 5 |
15-02-2020 | 7 | 1 | 6 | 25-07-2021 | 7 | 1 | 6 |
14-07-2021 | 7 | 1 | 7 | 25-07-2021 | 7 | 1 | 7 |
19-09-2021 | 7 | 1 | 8 | 28-08-2021 | 7 | 1 | 8 |
25-12-2021 | 7 | 1 | 9 | 10-09-2021 | 7 | 1 | 9 |
30-12-2021 | 7 | 1 | 10 | 12-09-2021 | 7 | 1 | 10 |
10-02-2022 | 7 | 1 | 11 | 25-04-2022 | 7 | 1 | 11 |
15-03-2022 | 7 | 1 | 12 | Nan | Nan | Nan | Nan |
15-03-2022 | 7 | 1 | 13 | Nan | Nan | Nan | Nan |
14-06-2022 | 7 | 1 | 14 | Nan | Nan | Nan | Nan |
--**Now i have to use date condition also to get the required output
--EXPECTED OUTPUT
date_buy | id | qty_buy | rolling_sum_qty_buy | date_sold | id | qty_sold | rolling_sum_qty_sold |
---|---|---|---|---|---|---|---|
30-07-2019 | 7 | 1 | 1 | 01-08-2019 | 7 | 1 | 1 |
20-10-2019 | 7 | 1 | 2 | 27-12-2019 | 7 | 1 | 3 |
17-01-2020 | 7 | 1 | 3 | 01-02-2020 | 7 | 1 | 4 |
15-02-2020 | 7 | 1 | 4 | 25-07-2021 | 7 | 1 | 6 |
15-02-2020 | 7 | 1 | 5 | 25-07-2021 | 7 | 1 | 7 |
15-02-2020 | 7 | 1 | 6 | 28-08-2021 | 7 | 1 | 8 |
14-07-2021 | 7 | 1 | 7 | 10-09-2021 | 7 | 1 | 9 |
19-09-2021 | 7 | 1 | 8 | 25-04-2022 | 7 | 1 | 11 |
25-12-2021 | 7 | 1 | 9 | NaN | NaN | NaN | NaN |
30-12-2021 | 7 | 1 | 10 | NaN | NaN | NaN | NaN |
10-02-2022 | 7 | 1 | 11 | NaN | NaN | NaN | NaN |
15-03-2022 | 7 | 1 | 12 | NaN | NaN | NaN | NaN |
15-03-2022 | 7 | 1 | 13 | NaN | NaN | NaN | NaN |
14-06-2022 | 7 | 1 | 14 | NaN | NaN | NaN | NaN |
-- Please help me to get the following output. Any help would be appreciated. We can also use any loop or can also define any custom function or can create any extra column if required to get desired output.
Upvotes: 0
Views: 79
Reputation: 2887
I cannot get the complete answer, but it might help a little:
pd.merge_asof
is essentially merging on the closest datetime. If you format you DataFrame
as datetime64[ns]
df1["date_sold"] = pd.to_datetime(df1["date_sold"], format="%d-%m-%Y", exact=False)
df2["date_sold"] = pd.to_datetime(df2["date_sold"], format="%d-%m-%Y", exact=False)
you can merge them like this.
Out[32]:
date_buy id_x qty_buy rolling_sum_qty_buy date_sold id_y qty_sold rolling_sum_qty_sold
0 2019-07-30 7 1 1 2019-08-01 7.0 1.0 1.0
1 2019-10-20 7 1 2 2019-12-27 7.0 1.0 3.0
2 2020-01-17 7 1 3 2020-02-01 7.0 1.0 4.0
3 2020-02-15 7 1 4 2021-07-25 7.0 1.0 6.0
4 2020-02-15 7 1 5 2021-07-25 7.0 1.0 6.0
5 2020-02-15 7 1 6 2021-07-25 7.0 1.0 6.0
6 2021-07-14 7 1 7 2021-07-25 7.0 1.0 6.0
7 2021-09-19 7 1 8 2022-04-25 7.0 1.0 11.0
8 2021-12-25 7 1 9 2022-04-25 7.0 1.0 11.0
9 2021-12-30 7 1 10 2022-04-25 7.0 1.0 11.0
10 2022-02-10 7 1 11 2022-04-25 7.0 1.0 11.0
11 2022-03-15 7 1 12 2022-04-25 7.0 1.0 11.0
12 2022-03-15 7 1 13 2022-04-25 7.0 1.0 11.0
13 2022-06-14 7 1 14 NaT NaN NaN NaN
This will match every buy
with the first sold
. It is capable of e.g. skipping the sale on 2019-09-15
. However by definition it will do a left join and therefore just repeat elements if they are matching multiple. So instead of rolling_sum_qty_sold = 6, 7, 8, 9
it just gives 6, 6, 6, 6
and instead of 11, NaN, NaN, ...
it just keeps repeating 11
.
Hopefully this still helps you get on the right track :)
Upvotes: 1