Hold The Door
Hold The Door

Reputation: 39

How to subtract two dataframes with duplicate first column?

So, I have the following two dataframes and my ideal output is to get open_orders reduced by cancel_orders so I know how many open_orders I have.

Desired Output:

df_total_orders
           order_id business_symbol  open_orders
0      a1b2c3111111              AA          0.0
1      4kl3l2242244             AAA          0.0
2      21312a224144             BBB          0.0
3      124f32rv2323             CCC          0.0
4      2412dfe21edf             DDD          0.0
5555

What I have now:

df_add_orders (this dataframe contains duplicates in the order id column)
           order_id business_symbol  open_orders
0      a1b2c3111111              AA        100.0
1      4kl3l2242244             AAA       1000.0
2      21312a224144             BBB       1000.0
3      124f32rv2323             CCC       1000.0
4      2412dfe21edf             DDD       1000.0
10000
df_cancel_orders (this dataframe contains duplicates in the order id column)
           order_id   cancel_orders
0      a1b2c3111111           100.0
1      4kl3l2242244          1000.0
2      21312a224144          1000.0
3      124f32rv2323          1000.0
4      2412dfe21edf          1000.0
4000

I am using the following group by to get the total open orders, but it's removing my symbols with the output.

df_add_orders_group = df_add_orders.groupby(['order_id'], as_index=False)['open_orders'].sum()

df_add_orders_group
           order_id        open_orders
0      a1b2c3111111              110.0
1      4kl3l2242244             1200.0
2      21312a224144             1500.0
3      124f32rv2323             1500.0
4      2412dfe21edf             1500.0
5000

I am then subtracting closed orders

df_cancel_orders_group = df_cancel_orders.groupby(['order_id'], as_index=False)['cancel_orders'].sum()

but I want to keep the symbols so I can compare the total open orders and I need some way to merge the result of the subtraction with the main df_add_orders and also clean up order_id to sum duplicates.

Upvotes: 0

Views: 223

Answers (2)

Acccumulation
Acccumulation

Reputation: 3591

df_total_orders =df_add_orders.merge(df_cancel_orders, 
    how = 'left', 
    on = 'order_id)

will get you a dataframe with the data from the two original dataframes. You can then do

df_total_orders['open_orders'] = 
    df_total_orders['open_orders']-
    df_total_orders['cancel_orders']

and then drop the cancel_order column.

Another tactic would be to append the two original dataframes together. You can then do groupby over id and sum. This will create a dataframe that has, for each id, one column that is the sum of all open orders for that id, and another column that is the sum of all cancel orders for that id. You can then take the difference between the two columns.

When you append the two dataframes, you'll get a bunch of nulls; the portion from the open orders will have nulls in the cancel order column, and vice versa. I think summing over the groupby will just treat the nulls as zeros, but you might need to explicitly tell it to do so.

Upvotes: 1

not_speshal
not_speshal

Reputation: 23146

Try with pandas.merge and groupby:

merged = pd.merge(df_add_orders.groupby("order_id", as_index=False) \
                               .agg({"business_symbol": "first", 
                                     "open_orders": "sum"}), 
                 df_cancel_orders.groupby("order_id", as_index=False).sum(), 
                 on="order_id")

output = merged.assign(open_orders=merged["open_orders"]-merged["cancel_orders"]) \
               .drop("cancel_orders", axis=1)

>>> output
       order_id business_symbol  open_orders
0  124f32rv2323             CCC          0.0
1  21312a224144             BBB          0.0
2  2412dfe21edf             DDD          0.0
3  4kl3l2242244             AAA          0.0
4  a1b2c3111111              AA          0.0

Upvotes: 0

Related Questions