Andrew Mathews
Andrew Mathews

Reputation: 139

How do I use dask distributed?

I am trying to use Dask by looking at the code examples and documentation, and have trouble understanding how it works. As suggested in the document, I am trying to use the distributed scheduler (I also plan to deploy my code on an HPC).

The first simple thing I tried was this:

from dask.distributed import Client
import dask.bag as db

if __name__ == '__main__':
    client = Client(n_workers=2)

print("hello world")

The hello world got printed thrice, which I think is because of the workers. I was assuming that unless compute is called, the workers are not started. I can move my print statement to a function:

if __name__ == '__main__':
    client = Client(n_workers=2)

def print_func():
    print("hello world")

But, how do I make sure that only one worker executes this function? (In MPI, I can do this by using rank == 0; I did not find anything similar to MPI_Comm_rank() which can tell me the worker number or id in Dask).

I went further and started using an example provided in Dask:

from dask.distributed import Client
import dask.bag as db

if __name__ == '__main__':
    client = Client()

def is_even(n):
    return n % 2 == 0

b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
c = b.filter(is_even).map(lambda x: x ** 2)
print(c.compute())

But this shows up an error: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. I was assuming that the dask.bag would split up the compute work automatically. My apologies for a lengthy post, but I am having trouble wrapping my head around Dask (I am used to MPI and OpenMP programming).

Upvotes: 0

Views: 413

Answers (1)

hobbs
hobbs

Reputation: 239712

But, how do I make sure that only one worker executes this function? (In MPI, I can do this by using rank == 0; I did not find anything similar to MPI_Comm_rank() which can tell me the worker number or id in Dask).

This is effectively what the if __name__ == '__main__' block is checking. That condition is true when your script is run directly; it's not true when it's imported as a module by the workers. Any code that you put outside of this block is run by every worker when it starts up; this should be limited to function definitions and necessary global setup. All of the code that actually does work needs to be in the if __name__ == '__main__' block, or inside functions which are only called from inside that block.

Upvotes: 2

Related Questions