Naren Babu R
Naren Babu R

Reputation: 484

dask.distributed not utilising the cluster

I'm not able to process this block using the distributed cluster.

import pandas as pd
from dask import dataframe as dd 
import dask

df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
    for _, outer_row in dask_df.iterrows():
        for _, inner_row in dask_df.iterrows():
            for base_encod in outer_row['reid_encod']:
               for compare_encod in inner_row['reid_encod']:
                   val = base_encod + compare_encod
                   save_val.append(val)
    return save_val

from dask.distributed import Client

client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()

Also I have few queries

  1. Does dask.delayed use the available clusters to do the computation.

  2. Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.

  3. does dask.distributed work on pandas dataframe.

  4. can we use dask.delayed in dask.distributed.

  5. If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.

Upvotes: 1

Views: 667

Answers (2)

mdurant
mdurant

Reputation: 28683

For the record, some answers, although I wish to note my earlier general points about this question

Does dask.delayed use the available clusters to do the computation.

If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.

Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.

Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.

does dask.distributed work on pandas dataframe.

Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question

can we use dask.delayed in dask.distributed.

Yes, distributed can execute anything that dask in general can, including delayed functions/objects

If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.

Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.

In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.

Upvotes: 3

mdurant
mdurant

Reputation: 28683

The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.

Furthermore,

  • you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
  • this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
  • passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
  • the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

Upvotes: 3

Related Questions