Reputation: 51
I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:
import pandas as pd
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".
Below is the most efficient method that I could come up with. There has to be a better way!
name_a = []
name_b = []
for i in range(len(df_1)):
name_a.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_a.iloc[i]].index[0]])
name_b.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_b.iloc[i]].index[0]])
df_1['name_a'] = name_a
df_1['name_b'] = name_b
Resulting in:
>>> df_1.head()
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
Upvotes: 5
Views: 19094
Reputation: 641
I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.
name_map = dict(df_2.set_index('num').name)
df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)
df_1
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.
Upvotes: 3
Reputation: 294488
High Level
replace
replace
, rename
columns, and join
m = dict(zip(
df_2.num.values.tolist(),
df_2.name.values.tolist()
))
df_1.join(
df_1.replace(m).rename(
columns=lambda x: x.replace('num', 'name')
)
)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Breakdown
replace
with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2
. As a matter of fact we could have used a pd.Series
. I chose to build with dict
and zip
because I find that it's faster.
Building m
Option 1
m = df_2.set_index('num').name
Option 2
m = df_2.set_index('num').name.to_dict()
Option 3
m = dict(zip(df_2.num, df_2.name))
Option 4 (My Choice)
m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
m
build times
1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop
%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
Replacing num
Again, we have choices, here are a few and their times.
%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop
I choose...
df_1.replace(m)
num_a num_b
0 a b
1 b d
2 c a
3 d b
4 5 c
Rename columns
df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))
name_a name_b <-- note the column name change
0 a b
1 b d
2 c a
3 d b
4 5 c
Join
df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Upvotes: 5
Reputation: 341
Just try a conditional statement:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1])
print(df_1)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
Upvotes: 0