Julias
Julias

Reputation: 5892

How distribute multiple computations with Ray

In my use case, I have a huge computation that contains many stages.

The stages must be executed sequentially.

For every stage, there could be many tasks, and they can be executed in parallel. Each task requires 1 CPU.

Each task gets as input a big dataset (for example 150G).

The dataset is located in remote storage but cached to the machine. So it is important to be machine sticky as much as possible.

For example:
Computation
- Stage1  (100 tasks)
- Stage2  (200 tasks)
 ....

In a system, there are several users, and they can run several computations (huge, large, ... depends on use case)

My question can I use ray framework for this system. And what would be the best approach to use it? Is that possible to hint ray for every stage/computation, which machines should be used?

If not ray, what framework might fit the requirements?

Upvotes: 0

Views: 893

Answers (2)

Alex
Alex

Reputation: 1448

You can definitely do this with Ray. For example

ray.init(address="same_address_for_all_jobs")

@ray.remote
def stage1(x):
  # do stage 1 compute here

@ray.remote
def stage2(x):
  # do stage 2 compute here

stage1_results = []
for i in range(100):
    ref = stage1.options(resources={ # add resources here to specify requirements to run on a certain node }).remote(i)


stage2_results = []
for stage1_ref in stage1_results:
  # Now stage2 will wait for the corresponding task from stage 1 before starting.
  stage2_ref = stage2.remote(stage1_ref)
  stage2_results.append(stage2_ref)

Note that once the dataset is in Ray's object store, Ray's scheduler is locality aware, so it should schedule the remaining tasks on the same nodes as the data when possible.

Upvotes: 1

André
André

Reputation: 1068

As I see it, SLURM may be an easier option. It seems easier to make jobs "sticky", since you can specify the actual nodes. There, you could define a job array for each stage, while each task will be executed on a single machine. At least that is how I would solve it. That being said, there's probably also a way to make it work with ray, depending on how exactly your problem is defined and if you need to reassign specific tasks to specific nodes later on. Since ray does not expose the cluster topology for assignment, this may be a problem if you want to use ray.

Upvotes: 1

Related Questions