Reputation: 31
I have a dataframe that contains a column holding 'Log' strings.
I'd like to create a new column based on the values I've parsed from the 'Log' column.
Currently, I'm using .apply()
with the following function:
def classification(row):
if 'A' in row['Log']:
return 'Situation A'
elif 'B' in row['Log']:
return 'Situation B'
elif 'C' in row['Log']:
return 'Situation C'
return 'Check'
it looks like:
df['Classification'] = df.apply(classification, axis=1)
The issue is that it takes a lot of time (~3min to a dataframe with 4M rows) and I'm looking for a faster way.
I saw some examples of users using vectorized functions that run much faster but those don't have if statements in the function.
My question - is it possible to vectorize the function I've added and what is the fastest way to perform
this task?
Upvotes: 2
Views: 199
Reputation: 6495
I would not be sure that using a nested numpy.where
will increase performance: here some test performace with 4M rows
import numpy as np
import pandas as pd
ls = ['Abc', 'Bert', 'Colv', 'Dia']
df = pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})
df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A',
np.where(df['Log'].str.contains('B'), 'Situation B',
np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
def classification(x):
if 'A' in x:
return 'Situation A'
elif 'B' in x:
return 'Situation B'
elif 'C' in x:
return 'Situation C'
return 'Check'
df['Log_apply'] = df['Log'].apply(classification)
Nested np.where Performance
%timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Applymap Performance
%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At least with my machine using nested np.where
is almost 10x times slower than a applymap
.
A final remark: using the solution suggested in the comments, i.e. something like:
d = {'A': 'Situation A',
'B': 'Situation B',
'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')
has the following problems:
It won't necessarely be faster - testing on my machine:
%timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The .extract
method follows string order i.e. from the string 'AB'
will extract 'A'
and from 'BA'
will extract 'B'
. On the other hand the OP function classification
has an hierarchical ordering of extraction, thus extract 'A'
in both cases.
Upvotes: 2