Reputation: 139
I am trying to use Dask by looking at the code examples and documentation, and have trouble understanding how it works. As suggested in the document, I am trying to use the distributed scheduler (I also plan to deploy my code on an HPC).
The first simple thing I tried was this:
from dask.distributed import Client
import dask.bag as db
if __name__ == '__main__':
client = Client(n_workers=2)
print("hello world")
The hello world
got printed thrice, which I think is because of the workers. I was assuming that unless compute is called, the workers are not started.
I can move my print statement to a function:
if __name__ == '__main__':
client = Client(n_workers=2)
def print_func():
print("hello world")
But, how do I make sure that only one worker executes this function? (In MPI, I can do this by using rank == 0
; I did not find anything similar to MPI_Comm_rank()
which can tell me the worker number or id in Dask).
I went further and started using an example provided in Dask:
from dask.distributed import Client
import dask.bag as db
if __name__ == '__main__':
client = Client()
def is_even(n):
return n % 2 == 0
b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
c = b.filter(is_even).map(lambda x: x ** 2)
print(c.compute())
But this shows up an error: An attempt has been made to start a new process before the current process has finished its bootstrapping phase
. I was assuming that the dask.bag
would split up the compute work automatically.
My apologies for a lengthy post, but I am having trouble wrapping my head around Dask (I am used to MPI and OpenMP programming).
Upvotes: 0
Views: 413
Reputation: 239712
But, how do I make sure that only one worker executes this function? (In MPI, I can do this by using
rank == 0
; I did not find anything similar toMPI_Comm_rank()
which can tell me the worker number or id in Dask).
This is effectively what the if __name__ == '__main__'
block is checking. That condition is true when your script is run directly; it's not true when it's imported as a module by the workers. Any code that you put outside of this block is run by every worker when it starts up; this should be limited to function definitions and necessary global setup. All of the code that actually does work needs to be in the if __name__ == '__main__'
block, or inside functions which are only called from inside that block.
Upvotes: 2