Foobar
Foobar

Reputation: 8467

How to spawn multiple workers to read a dataset with Ray?

I have a Ray dataset that I created with:

items = ray.data.read_datasource(
    TarDatasource(extra_tar_flags="--strip-components 2", profile=True),
    paths=S3_SOURCE_URL,
    filesystem=fs,
    include_paths=True,
)

total_items = items.count()

Right now, counting the number of items in this dataset is very slow because all the processing is done on a single node.

I'd like to increase the number of worker nodes that spawn in order to count all the items in this dataset (I'm using a Ray cluster).

Does anyone know how to do this? I tried using parallelism=100 as a kwarg, but that did not spawn 100 worker nodes / create a 100 tasks.

Upvotes: 0

Views: 631

Answers (1)

jianxiao
jianxiao

Reputation: 76

FYI, the discussion of this same question happened here: https://discuss.ray.io/t/how-to-increase-parallelism-for-dataset-count/7864

Upvotes: 1

Related Questions