Reputation: 8467
I have a Ray dataset that I created with:
items = ray.data.read_datasource(
TarDatasource(extra_tar_flags="--strip-components 2", profile=True),
paths=S3_SOURCE_URL,
filesystem=fs,
include_paths=True,
)
total_items = items.count()
Right now, counting the number of items in this dataset is very slow because all the processing is done on a single node.
I'd like to increase the number of worker nodes that spawn in order to count all the items in this dataset (I'm using a Ray cluster).
Does anyone know how to do this?
I tried using parallelism=100
as a kwarg, but that did not spawn 100 worker nodes / create a 100 tasks.
Upvotes: 0
Views: 631
Reputation: 76
FYI, the discussion of this same question happened here: https://discuss.ray.io/t/how-to-increase-parallelism-for-dataset-count/7864
Upvotes: 1