Reputation: 31

Applying a vectorized function with several returns to pandas dataframe

I have a dataframe that contains a column holding 'Log' strings. I'd like to create a new column based on the values I've parsed from the 'Log' column. Currently, I'm using .apply() with the following function:

def classification(row):
    if 'A' in row['Log']:
        return 'Situation A'
    elif 'B' in row['Log']:
        return 'Situation B'
    elif 'C' in row['Log']:
        return 'Situation C'
    return 'Check'

it looks like: df['Classification'] = df.apply(classification, axis=1) The issue is that it takes a lot of time (~3min to a dataframe with 4M rows) and I'm looking for a faster way. I saw some examples of users using vectorized functions that run much faster but those don't have if statements in the function. My question - is it possible to vectorize the function I've added and what is the fastest way to perform
this task?

Upvotes: 2

Answers (1)

FBruzzesi

Reputation: 6495

I would not be sure that using a nested numpy.where will increase performance: here some test performace with 4M rows

import numpy as np
import pandas as pd

ls = ['Abc', 'Bert', 'Colv', 'Dia']
df =  pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})

df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A', 
                      np.where(df['Log'].str.contains('B'), 'Situation B', 
                          np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))


def classification(x):
    if 'A' in x:
        return 'Situation A'
    elif 'B' in x:
        return 'Situation B'
    elif 'C' in x:
        return 'Situation C'
    return 'Check'


df['Log_apply'] = df['Log'].apply(classification)

Nested np.where Performance

 %timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applymap Performance

%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

At least with my machine using nested np.where is almost 10x times slower than a applymap.

A final remark: using the solution suggested in the comments, i.e. something like:

d = {'A': 'Situation A',
     'B': 'Situation B',
     'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')

has the following problems:

It won't necessarely be faster - testing on my machine:

%timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The .extract method follows string order i.e. from the string 'AB' will extract 'A' and from 'BA' will extract 'B'. On the other hand the OP function classification has an hierarchical ordering of extraction, thus extract 'A' in both cases.

Upvotes: 2

Applying a vectorized function with several returns to pandas dataframe

Answers (1)

Related Questions