doomdaam
doomdaam

Reputation: 783

Pandas merge two dataframes on column with different length

I am working on a trading algo but I have some issues when trying to combine the buy_orders and sell_orders dataframes to a single dataframe, orders.

The issue shows it self on the buy_order date 2021-01-21 where a recommendation was made by my algo to buy, but this has no sell order yet as the signal hasn't been spotted yet, therefore these should be NAN when merged. If I was to join on index the sell order would be the sell order for a different stock from the sell_orders dataframe.

buy_orders dataframe

    Date_buy    Name    Stock_Price_buy Order
26  2020-07-30  AAPL    96.19   buy
27  2020-09-30  AAPL    115.81  buy
28  2020-11-05  AAPL    119.03  buy
29  2020-11-30  AAPL    119.05  buy
30  2021-01-21  AAPL    136.87  buy
31  2020-10-11  ABBV    21.21   buy

sell_orders dataframe

    Date_sell   Name    Stock_Price_sell    Order
25  2020-07-20  AAPL    98.36   sell
26  2020-09-02  AAPL    131.40  sell
27  2020-10-20  AAPL    117.51  sell
28  2020-11-20  AAPL    117.34  sell
29  2021-01-04  AAPL    129.41  sell
30  2020-10-15  ABBV    24.23   sell

Ideal result would be the orders dataframe as demonstrated below.

Index Buy_date  Name_x  Stock_Price_buy Order_x Sell_date   Name_y  Stock_Price_buy Order_y
26  2020-07-30  AAPL    96.19           buy     2020-09-02  AAPL    131.40          sell
27  2020-09-30  AAPL    115.81          buy     2020-10-20  AAPL    117.51          sell
28  2020-11-05  AAPL    119.03          buy     2020-11-20  AAPL    117.34          sell
29  2020-11-30  AAPL    119.05          buy     2021-01-04  AAPL    129.41          sell
30  2021-01-21  AAPL    136.87          buy     NaN         NaN     NaN             NaN

Here's how the orders dataframe looks like now when buy_orders.Name_x and sell_orders.Name_y are different for the first time. ABBV sell_order should have been NANs

28  2020-11-05  AAPL    119.03  buy 2020-11-20  AAPL    117.34  sell
29  2020-11-30  AAPL    119.05  buy 2021-01-04  AAPL    129.41  sell
30  2021-01-21  AAPL    136.87  buy 2018-05-24  ABBV    24.23   sell

Upvotes: 1

Views: 191

Answers (2)

Rick M
Rick M

Reputation: 1012

Assuming your data is structured so that it always starts with a buy and alternates with sell orders, and only has one transaction per day, and is always one sized lot per transaction... you can use pd.concat. I made a simple dataframe that is kind of like yours (and in the future, it makes it easier if you include code to make a sample dataframe as part of your question):

buy_orders = pd.DataFrame.from_dict({'Date_buy': [ pd.to_datetime('2020-11-01'), pd.to_datetime('2020-11-03'), 
                                           pd.to_datetime('2020-11-05'), pd.to_datetime('2020-11-08'), 
                                           pd.to_datetime('2020-11-10')], 
                                 'Order' : ['B','B','B','B','B'],
                                 'Name' : ['AAPL','AAPL','AAPL','AAPL','ABBV'],
                                 'Stock_Price_buy' : [1,2,3,4,5.0]})

sell_orders = pd.DataFrame.from_dict({'Date_sell': [ pd.to_datetime('2020-11-02'), pd.to_datetime('2020-11-04'), 
                                           pd.to_datetime('2020-11-06'), pd.to_datetime('2020-11-12'), 
                                           pd.to_datetime('2020-11-22')], 
                                 'Order' : ['S','S','S','S','S'],
                                 'Name' : ['AAPL','AAPL','AAPL','ABBV','ABBV'],
                                 'Stock_Price_sell' : [23,24,25,26,5.0]})

You can first stack the two dataframes and sort them by date and ticker (after normalizing the column names):

buy_orders = buy_orders.rename(columns={'Date_buy' : "Date", "Stock_Price_buy" : "Price"})
sell_orders = sell_orders.rename(columns={'Date_sell' : "Date", "Stock_Price_sell" : "Price"})

df = pd.concat([buy_orders, sell_orders])
df = df.sort_values(['Date','Order']).reset_index(drop=True)

...then making a copy of the dataframe (changing the column names to keep them distinct in the later concat step):

df2 = df.copy()
df2.columns = [f"{c}_sell" for c in df.columns]

You then concatenate the two dataframes next to each other, but with a .shift(-1) on the second one so that they're offset

df3 = pd.concat([df, df2.shift(-1)], axis=1)

Finally, you can clean up the junk rows:

cut = ( df3.Name != df3.Name_sell)

import numpy as np 
df3.loc[cut, 'Date_sell'] = np.nan
df3.loc[cut, 'Order_sell'] = np.nan
df3.loc[cut, 'Price_sell'] = np.nan

df3 = df3.drop(columns='Name_sell')

df3 = df3[df3.Order!="S"].reset_index(drop=True).copy()

That gives you something like

        Date Order  Name  Price  Date_sell Order_sell  Price_sell
0 2020-11-01     B  AAPL    1.0 2020-11-02          S        23.0
1 2020-11-03     B  AAPL    2.0 2020-11-04          S        24.0
2 2020-11-05     B  AAPL    3.0 2020-11-06          S        25.0
3 2020-11-08     B  AAPL    4.0        NaT        NaN         NaN
4 2020-11-10     B  ABBV    5.0 2020-11-12          S        26.0

You don't have to make all the intermediate dataframes, etc, but I left the code that way here so that if you paste things in a notebook you can look at the steps.

Upvotes: 0

wwnde
wwnde

Reputation: 26676

Have you thought of join but then declaring suffixes as follows?.

buy_orders.join(sell_orders,lsuffix='_buy', rsuffix='_sell')

        Date_buy Name_buy  Stock_Price_buy Order_buy   Date_sell Name_sell  \
26  2020-07-30     AAPL            96.19       buy  2020-09-02      AAPL   
27  2020-09-30     AAPL           115.81       buy  2020-10-20      AAPL   
28  2020-11-05     AAPL           119.03       buy  2020-11-20      AAPL   
29  2020-11-30     AAPL           119.05       buy  2021-01-04      AAPL   
30  2021-01-21     AAPL           136.87       buy         NaN       NaN   

    Stock_Price_sell Order_sell  
26            131.40       sell  
27            117.51       sell  
28            117.34       sell  
29            129.41       sell  
30               NaN        NaN  

Upvotes: 2

Related Questions