Vinay Gudisagar
Vinay Gudisagar

Reputation: 93

Compare Pandas dataframes and add column

I have two dataframe as below

df1     df2 
A       A   C
A1      A1  C1
A2      A2  C2
A3      A3  C3
A1      A4  C4
A2          
A3          
A4          

The values of column 'A' are defined in df2 in column 'C'. I want to add a new column to df1 with column B with its value from df2 column 'C'

The final df1 should look like this

df1
A   B
A1  C1
A2  C2
A3  C3
A1  C1
A2  C2
A3  C3
A4  C4

I can loop over df2 and add the value to df1 but its time consuming as the data is huge.

    for index, row in df2.iterrows():
           df1.loc[df1.A.isin([row['A']]), 'B']= row['C']

Can someone help me to understand how can I solve this without looping over df2.

Thanks

Upvotes: 2

Views: 2313

Answers (3)

Divakar
Divakar

Reputation: 221574

Based on searchsorted method, here are three approaches with different indexing schemes -

df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]

Upvotes: 2

jezrael
jezrael

Reputation: 862731

You can use map by Series:

df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
    A   B
0  A1  C1
1  A2  C2
2  A3  C3
3  A1  C1
4  A2  C2
5  A3  C3
6  A4  C4

It is same as map by dict:

d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}

df1['B'] = df1.A.map(d)
print (df1)
    A   B
0  A1  C1
1  A2  C2
2  A3  C3
3  A1  C1
4  A2  C2
5  A3  C3
6  A4  C4

Timings:

len(df1)=7:

In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop

In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop

len(df1)=70k:

In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop

In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop

Upvotes: 4

EdChum
EdChum

Reputation: 394061

IIUC you can just merge and rename the col

df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})

In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged

Out[103]:
    A   B
0  A1  C1
1  A2  C2
2  A3  C4
3  A1  C1
4  A2  C2
5  A3  C4
6  A4  C4

Upvotes: 2

Related Questions