Updating column in pandas DataFrame based on another column and a separate dictionary

Question

I am trying to make a piece of my code run quicker.

I have two dataframes of different sizes, A and B. I have a dictionary of ages too called age_dict.

A contains 100 rows, and B contains 200 rows. They both use an index starting at 0. They both have two columns which are "Name" and "Age"

The dictionary keys are names and their values are ages. All keys are unique, there are no duplicates

{'John':20,'Max':25,'Jack':30}

I want to find the names in each DataFrame and assign them the age from the dictionary. I achieve this using the following code (I want to return a new DataFrame and not amend the old one):

def age(df):
   new_df = df.copy(deep=True)
   i = 0
   while i < len(new_df['Name']):
       name = new_df['Name'][i]
       age = age_dict[name]
       new_df['Age'][i] = age
       i += 1
   return new_df

new_A = age(A)
new_B = age(B)

This code takes longer than I want it to, so I'm wondering if pandas has an easier way to do this instead of me looping through each row?

Thank you!

jezrael · Accepted Answer

I think you need map:

A = pd.DataFrame({'Name':['John','Max','Joe']})
print (A)
   Name
0  John
1   Max
2   Joe

d = {'John':20,'Max':25,'Jack':30}

A1 = A.copy(deep=True)
A1['Age'] = A.Name.map(d)
print (A1)
   Name   Age
0  John  20.0
1   Max  25.0
2   Joe   NaN

If need function:

d = {'John':20,'Max':25,'Jack':30}

def age(df):
   new_df = df.copy(deep=True)
   new_df['Age'] = new_df.Name.map(d)
   return new_df

new_A = age(A)
print (new_A)
   Name   Age
0  John  20.0
1   Max  25.0
2   Joe   NaN

Timings:

In [191]: %timeit (age(A))
10 loops, best of 3: 21.8 ms per loop

In [192]: %timeit (jul(A))
10 loops, best of 3: 47.6 ms per loop

Code for timings:

A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300000 rows x 2 columns]
A = pd.concat([A]*100000).reset_index(drop=True)
print (A)

d = {'John':20,'Max':25,'Jack':30}

def age(df):
   new_df = df.copy(deep=True)
   new_df['Age'] = new_df.Name.map(d)
   return new_df

def jul(A):
    df = pd.DataFrame({'Name': list(d.keys()), 'Age': list(d.values())})
    A1 = pd.merge(A, df, how='left')
    return A1

A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300 rows x 2 columns]
A = pd.concat([A]*100).reset_index(drop=True)


In [194]: %timeit (age(A))
The slowest run took 5.22 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 742 µs per loop

In [195]: %timeit (jul(A))
The slowest run took 4.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.87 ms per loop

Updating column in pandas DataFrame based on another column and a separate dictionary

Answers (2)

Related Questions