spiky
spiky

Reputation: 61

In spark-submit command, does there exist a flag to control the level of parallelism

In Apache Spark, For "Spark-submit" command,does there exists a "flag" to control the level of parallelism.

Upvotes: 0

Views: 521

Answers (1)

Ramkumar Venkataraman
Ramkumar Venkataraman

Reputation: 868

You could try to set the number of executors using num-executors and then set the number of cores that you can play with using either --executor-cores or --total-executor-cores. You could pass these as command-line args or in the spark config file. But this works only on YARN mode.

The actual parallelism in Spark however is controlled by the number of partitions in an dataframe/RDD. Generally, when you create an RDD, you can specify the amount of partitions that you need. You can also see the default parallelism using sc.defaultParallelism. So if you assign less partitions than the number of cores, then you essentially waste some of the cores that you have g.

Now Spark takes in the RDD, distributes it across the cluster and spawns tasks (which are essentially closures created from your code) that operates on the partitions. The number of tasks that get spawned will be the number of cores in your cluster (or the parameter that you had passed). The general rule of thumb is to have 2-3 tasks per core, as the task startup time in Spark is very minimal.

Upvotes: 1

Related Questions