Jyotsna_b
Jyotsna_b

Reputation: 1053

DASK Dataframe within loops

I'm having some trouble trying to implement loops in Dask. for example in the following code:

for i in range(len(col)):
    if df[col[i]].dtype=='object':
        pass
    elif df[col[i]].std().compute()==0:
        cols_constant.append(col[i])
df = df.drop(cols_constant,axis=1)

The same code is very quick using pandas but on dask it is taking a considerable amount of time to complete the task.

I understand Dask is inefficient over loops. But how can I optimize my code for Dask for functions similar to the one above?

I cannot use e.persist() since we intend to do the computation on multiple worker systems.

Will it be useful to use the function 'dask.do' to parallelize the same task?

Upvotes: 1

Views: 2034

Answers (1)

MRocklin
MRocklin

Reputation: 57281

Every time you call df.column.std.compute() you incur both the costs of calling std() but also the costs of creating df. If you created df from a pandas dataframe then this is cheap, but if you created df from some more expensive process, like reading in from CSV files, then this can be very expensive.

df = dd.from_pandas()  # ok to call compute many times, data is already in memory
df = dd.read_csv(...)  # slow to call compute many times, we read the all the csv files every time you call compute

If you have the memory then you can avoid this repeated cost by calling persist

df = df.persist()

In your question you say that you can't use persist because you plan to do this on multi-worker systems. To be clear, if you have the available memory, persist works in all cases both single-worker and multi-worker.

You can also avoid repeated calls to compute by just calling compute once.

stds = [df[column].std() for column in df.columns]
stds = dask.compute(stds)

This computes everything in a single pass

Upvotes: 3

Related Questions