Reputation: 3359
So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever. I'm even trying to use pandarallel but that is not working well either. In any case, here is the code that I'm using...
def classifyForecast(dataframe):
buckets = len(dataframe[dataframe['QUANTITY'] != 0])
try:
adi = dataframe.shape[0] / buckets
cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()
if adi < 1.32:
if cov < .49:
dataframe['TYPE'] = 'Smooth'
else:
dataframe['TYPE'] = 'Erratic'
else:
if cov < .49:
dataframe['TYPE'] = 'Intermittent'
else:
dataframe['TYPE'] = 'Lumpy'
except:
dataframe['TYPE'] = 'Smooth'
try:
dataframe['ADI'] = adi
except:
dataframe['ADI'] = np.inf
try:
dataframe['COV'] = cov
except:
dataframe['COV'] = np.inf
return dataframe
from pandarallel import pandarallel
pandarallel.initialize()
def quick_classification(df):
return df.parallel_apply(classifyForecast(df))
Also, please note that I am splitting the dataframe up into batches. I don't want the function to work on each row, but instead I want it to work on the chunks. That way I can get the .mean()
and .std()
of specific columns.
It shouldn't take 48 hours to complete. How do I speed this up?
Upvotes: 1
Views: 300
Reputation: 3261
It looks like mean
and std
are the only calculations here so I'm guessing that this is the bottleneck.
You could try speeding it up with numba
.
from numba import njit
import numpy as np
@njit(parallel=True)
def numba_mean(x):
return np.mean(x)
@njit(parallel=True)
def numba_std(x):
return np.std(x)
cov = numba_std(dataframe['QUANTITY'].values) / numba_mean(dataframe['QUANTITY'].values)
Upvotes: 1