JCF
JCF

Reputation: 309

Defining a function to be used on a DataFrame using .apply()

I would like to define custom functions with signatures that includes one or several columns of the dataframe.

I tried to replicate the solution mentioned in article "How do I use Pandas 'apply' function to multiple columns?", but I fail to understand the way a function needs to be set up to accept data from other columns as input.

A sample of my code:

import pandas as pd

df=pd.DataFrame({'NAME':['A','B','C','D'],'HOURS':[38, 52, 1040, 28],'ROLE':['Manager','Expert','Expert','Expert']})

def apply_rate(col1='HOURS', col2='ROLE'):
    if row[col2]=='Manager': return row[col1]*165
    else: return row[col1]*135

df['TOTAL']=df.apply(lambda row: apply_rate(row['HOURS'],row['ROLE']),axis=1)

I get a message "KeyError: ('Manager', 'occurred at index 0')", but I am stuck at this stage, and I don't know how to get out of this blocking point.

Upvotes: 1

Views: 147

Answers (1)

jpp
jpp

Reputation: 164673

The trick is to remove the lambda altogether. Feed your function to pd.DataFrame.apply and, possibly, feed your additional function parameters to apply directly:

def apply_rate(row, col1, col2):
    if row[col2]=='Manager': return row[col1]*165
    else: return row[col1]*135

df['TOTAL'] = df.apply(apply_rate, axis=1, col1='HOURS', col2='ROLE')

print(df)

  NAME  HOURS     ROLE   TOTAL
0    A     38  Manager    6270
1    B     52   Expert    7020
2    C   1040   Expert  140400
3    D     28   Expert    3780

However, row-wise operations are inefficient and not a recommended use of Pandas. You can easily vectorise your algorithm with column-wise operations:

df['TOTAL'] = df['HOURS'] * np.where(df['ROLE'] == 'Manager', 165, 135)

An alternative, more easily extendable version can make use of a dictionary mapping:

factor_map = {'Manager': 165}
df['TOTAL'] = df['HOURS'] * df['ROLE'].map(factor_map).fillna(135)

Upvotes: 3

Related Questions