Neil
Neil

Reputation: 8247

how to call function in apply in pandas

I have following dataframe in pandas

code     job_descr               job_type     
123      sales executive         nan
124      data scientist          nan
145      marketing manager       nan
132      finance                 nan
144      data analyst            nan

I want to classify job_descr to job_type as follows

sales : Sales
marketing : Marketing
finance : Finance
data science : Analytics
analyst : Analytics

I am doing following in pandas

def job_type_redifine(column_name):
   if column_name.str.contains('sales'):
       return 'Sales'
   elif column_name.str.contains('marketing'):
       return 'Marketing'
   elif column_name.str.contains('data science|data scientist|analyst|machine learning'):
    return 'Analytics'
   else:
       return 'Others'


final_df['job_type'] = final_df.apply(lambda row: 
                       job_type_redifine(row['job_descr']), axis=1)

Desired dataframe

code     job_descr               job_type     
123      sales executive         Sales
124      data scientist          Analytics
145      marketing manager       Marketing
132      finance                 Finance
144      data analyst            Analytics

Upvotes: 2

Views: 59

Answers (1)

jezrael
jezrael

Reputation: 862611

First solution is with numpy.select and Series.str.contains, advatage is working with missing values, but is slowier:

m1 = final_df['job_descr'].str.contains('sales')
m2 = final_df['job_descr'].str.contains('marketing')
m3 = final_df['job_descr'].str.contains('data science|data scientist|analyst|machine learning')

final_df['job_type'] = np.select([m1, m2, m3], 
                                 ['Sales','Marketing','Analytics'], default='Others')

print (final_df)
   code          job_descr   job_type
0   123    sales executive      Sales
1   124     data scientist  Analytics
2   145  marketing manager  Marketing
3   132            finance     Others
4   144       data analyst  Analytics

Solution with Series.apply - for test matching values is use in, here is loop by each value, but it is faster, because pandas text function are slow. Disadvatage is a bit complicated last condition with many or:

def job_type_redifine(column_name):
   if 'sales' in column_name:
       return 'Sales'
   elif 'marketing' in column_name:
       return 'Marketing'
   elif  ('data science' in column_name or 'data scientist' in column_name 
         or 'analyst' in column_name or 'machine learning' in column_name):
      return 'Analytics'
   else:
       return 'Others'


final_df['job_type'] =  final_df['job_descr'].apply(job_type_redifine)
print (final_df)
   code          job_descr   job_type
0   123    sales executive      Sales
1   124     data scientist  Analytics
2   145  marketing manager  Marketing
3   132            finance     Others
4   144       data analyst  Analytics

Performance:

#[5000 rows x 3 columns]
final_df = pd.concat([final_df] * 1000, ignore_index=True)

In [13]: %%timeit
    ...: m1 = final_df['job_descr'].str.contains('sales')
    ...: m2 = final_df['job_descr'].str.contains('marketing')
    ...: m3 = final_df['job_descr'].str.contains('data science|data scientist|analyst|machine learning')
    ...: 
    ...: final_df['job_type'] = np.select([m1, m2, m3], ['Sales','Marketing','Analytics'], default='Others')
    ...: 
12.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %%timeit 
    ...: final_df['job_type1'] =  final_df['job_descr'].apply(job_type_redifine)
    ...: 
1.95 ms ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Related Questions