How to spawn multiple workers to read a dataset with Ray?

Question

I have a Ray dataset that I created with:

items = ray.data.read_datasource(
    TarDatasource(extra_tar_flags="--strip-components 2", profile=True),
    paths=S3_SOURCE_URL,
    filesystem=fs,
    include_paths=True,
)

total_items = items.count()

Right now, counting the number of items in this dataset is very slow because all the processing is done on a single node.

I'd like to increase the number of worker nodes that spawn in order to count all the items in this dataset (I'm using a Ray cluster).

Does anyone know how to do this? I tried using parallelism=100 as a kwarg, but that did not spawn 100 worker nodes / create a 100 tasks.

How to spawn multiple workers to read a dataset with Ray?

Answers (1)

Related Questions