How the Number of partitions and Number of concurrent tasks in spark calculated

Question

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.

I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).

When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.

I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.

Aditya Agarwal · Accepted Answer

This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.

You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.

you can see the more details on the following thread How does Spark paralellize slices to tasks/executors/workers?

How the Number of partitions and Number of concurrent tasks in spark calculated

Answers (1)

Related Questions