Ni_Tempe
Ni_Tempe

Reputation: 307

How merge works in pandas

I posted a questions earlier and got the answer.

Could you explain how the join final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0) works? I am getting the correct results, but I don't understand how the join happens. There are no common columns between df and temp_df.

Working code as below:

    d = {'emp': ['a',   'a',    'a',    'a',    'b',    'b',    'b',    'c',    'c',    'c',    'c' ], 
     'date': ['1',  '1',    '1',    '1',    '2',    '2',    '2',    '3',    '3',    '3',    '3' ], 
     'usd':[1,  2,  3,  4,  5,  6,  7,  8,  9,  10,     11 ], 
     'expense type':['Car Mileage',     'Car Rental',   'Car Rental - Gas',     'food',     'Car Rental',   'Car Rental - Gas',     'food',     'Car Mileage',  'Car Rental',   'food',     'wine' ],
     'zflag':['1',  '1', '1',   ' ',' ',' ',' ','2','2',' ',' ' ]
     }

    df = pd.DataFrame(data=d)



        df
    Out[253]: 
       date emp      expense type  usd zflag
    0     1   a       Car Mileage    1     1
    1     1   a        Car Rental    2     1
    2     1   a  Car Rental - Gas    3     1
    3     1   a              food    4      
    4     2   b        Car Rental    5      
    5     2   b  Car Rental - Gas    6      
    6     2   b              food    7      
    7     3   c       Car Mileage    8     2
    8     3   c        Car Rental    9     2
    9     3   c              food   10      
    10    3   c              wine   11   


temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)

Update 1:

temp_df doesn't have an index and it is a series. So I am not sure how the join can occur on the index as suggested in the comment.

temp_df
Out[335]: 
emp  date
a    1       1
c    3       2
Name: zzflag, dtype: int64

Upvotes: 1

Views: 135

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

pd.merge without on or index parameters is joining on common column names:

Per Docs in pandas API on pd.merge look at 'on' parameter:

on : label or list Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

d = {'emp': ['a',   'a',    'a',    'a',    'b',    'b',    'b',    'c',    'c',    'c',    'c' ], 
     'date': ['1',  '1',    '1',    '1',    '2',    '2',    '2',    '3',    '3',    '3',    '3' ], 
     'usd':[1,  2,  3,  4,  5,  6,  7,  8,  9,  10,     11 ], 
     'expense type':['Car Mileage',     'Car Rental',   'Car Rental - Gas',     'food',     'Car Rental',   'Car Rental - Gas',     'food',     'Car Mileage',  'Car Rental',   'food',     'wine' ],
     'zflag':['1',  '1', '1',   ' ',' ',' ',' ','2','2',' ',' ' ]
     }

df = pd.DataFrame(data=d)
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
a = temp_df.reset_index()

all(pd.merge(df, a) == pd.merge(df, a, on=['emp','date']))

Output:

True

Upvotes: 1

Related Questions