handavidbang
handavidbang

Reputation: 593

How to use a function across multiple columns using dask or parallel python

Suppose I have a function I want to apply to multiple columns. But rather than doing this sequentially, we do this in parallel. After going down the rabbit hole, I ended up learning about Dask which is the parallelization package for pandas.

I did a benchmark performance but the following code doesn't work as quickly as just doing it sequentially i.e. a

for col in columns: do stuff

Here is my code that handles datetime columns and extracts attributes such as day

@dask.delayed
def get_datestuff(data, datecol, intervals=["day", "month", "year", "dayofweek"], keep_orig=True):
    try:
        for interval_type in intervals:
            data.loc[:, datecol+interval_type] = eval('data.%s.dt.%s'% (datecol, interval_type))
        # Delete original columns
        if not keep_orig:
            data.drop(datecol, axis=1, inplace=True)
    except:
        print("Already done")
    finally:
        return data

So I'm doing this:

s = time.time()
for col in data.columns:
    get_datestuff(data, datecol=col).compute()
print(time.time()-s)

Am I doing this right?

Upvotes: 0

Views: 238

Answers (1)

MRocklin
MRocklin

Reputation: 57319

You want to avoid calling compute repeatedly. See this document:

https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly

Upvotes: 2

Related Questions