Reputation: 93
I have two dataframe as below
df1 df2
A A C
A1 A1 C1
A2 A2 C2
A3 A3 C3
A1 A4 C4
A2
A3
A4
The values of column 'A' are defined in df2 in column 'C'. I want to add a new column to df1 with column B with its value from df2 column 'C'
The final df1 should look like this
df1
A B
A1 C1
A2 C2
A3 C3
A1 C1
A2 C2
A3 C3
A4 C4
I can loop over df2 and add the value to df1 but its time consuming as the data is huge.
for index, row in df2.iterrows():
df1.loc[df1.A.isin([row['A']]), 'B']= row['C']
Can someone help me to understand how can I solve this without looping over df2.
Thanks
Upvotes: 2
Views: 2313
Reputation: 221574
Based on searchsorted
method, here are three approaches with different indexing schemes -
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]
Upvotes: 2
Reputation: 862731
You can use map
by Series
:
df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
It is same as map
by dict
:
d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
df1['B'] = df1.A.map(d)
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
Timings:
len(df1)=7
:
In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop
In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop
len(df1)=70k
:
In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop
In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop
Upvotes: 4
Reputation: 394061
IIUC you can just merge and rename the col
df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged
Out[103]:
A B
0 A1 C1
1 A2 C2
2 A3 C4
3 A1 C1
4 A2 C2
5 A3 C4
6 A4 C4
Upvotes: 2