Reputation: 593
Suppose I have a function I want to apply to multiple columns. But rather than doing this sequentially, we do this in parallel. After going down the rabbit hole, I ended up learning about Dask which is the parallelization package for pandas.
I did a benchmark performance but the following code doesn't work as quickly as just doing it sequentially i.e. a
for col in columns:
do stuff
Here is my code that handles datetime columns and extracts attributes such as day
@dask.delayed
def get_datestuff(data, datecol, intervals=["day", "month", "year", "dayofweek"], keep_orig=True):
try:
for interval_type in intervals:
data.loc[:, datecol+interval_type] = eval('data.%s.dt.%s'% (datecol, interval_type))
# Delete original columns
if not keep_orig:
data.drop(datecol, axis=1, inplace=True)
except:
print("Already done")
finally:
return data
So I'm doing this:
s = time.time()
for col in data.columns:
get_datestuff(data, datecol=col).compute()
print(time.time()-s)
Am I doing this right?
Upvotes: 0
Views: 238
Reputation: 57319
You want to avoid calling compute repeatedly. See this document:
https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
Upvotes: 2