Reputation: 13
I am trying to make a piece of my code run quicker.
I have two dataframes of different sizes, A and B. I have a dictionary of ages too called age_dict.
A contains 100 rows, and B contains 200 rows. They both use an index starting at 0. They both have two columns which are "Name" and "Age"
The dictionary keys are names and their values are ages. All keys are unique, there are no duplicates
{'John':20,'Max':25,'Jack':30}
I want to find the names in each DataFrame and assign them the age from the dictionary. I achieve this using the following code (I want to return a new DataFrame and not amend the old one):
def age(df):
new_df = df.copy(deep=True)
i = 0
while i < len(new_df['Name']):
name = new_df['Name'][i]
age = age_dict[name]
new_df['Age'][i] = age
i += 1
return new_df
new_A = age(A)
new_B = age(B)
This code takes longer than I want it to, so I'm wondering if pandas has an easier way to do this instead of me looping through each row?
Thank you!
Upvotes: 1
Views: 2639
Reputation: 862641
I think you need map
:
A = pd.DataFrame({'Name':['John','Max','Joe']})
print (A)
Name
0 John
1 Max
2 Joe
d = {'John':20,'Max':25,'Jack':30}
A1 = A.copy(deep=True)
A1['Age'] = A.Name.map(d)
print (A1)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
If need function:
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
new_A = age(A)
print (new_A)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
Timings:
In [191]: %timeit (age(A))
10 loops, best of 3: 21.8 ms per loop
In [192]: %timeit (jul(A))
10 loops, best of 3: 47.6 ms per loop
Code for timings:
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300000 rows x 2 columns]
A = pd.concat([A]*100000).reset_index(drop=True)
print (A)
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
def jul(A):
df = pd.DataFrame({'Name': list(d.keys()), 'Age': list(d.values())})
A1 = pd.merge(A, df, how='left')
return A1
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300 rows x 2 columns]
A = pd.concat([A]*100).reset_index(drop=True)
In [194]: %timeit (age(A))
The slowest run took 5.22 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 742 µs per loop
In [195]: %timeit (jul(A))
The slowest run took 4.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.87 ms per loop
Upvotes: 3
Reputation: 15433
You can create a another dataframe with your dict and merge the two dataframes based on a common key:
d = {'John':20,'Max':25,'Jack':30}
A = pd.DataFrame({'Name':['John','Max','Joe']})
df = pd.DataFrame({'Name': d.keys(), 'Age': d.values()})
A1 = pd.merge(A, df, how='left')
# Name Age
# 0 John 20
# 1 Max 25
# 2 Joe NaN
Upvotes: 1