QbS
QbS

Reputation: 494

Shuffle partitions conf changed in one of the paralleled Databricks tasks

I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions is reverted to the default value (200).

Maybe someone can help me to understand what generally happens if one of the tasks processed in parallel with other tasks receives change of the spark configuration for shuffle partitions?

  1. To be precise what happens with other entities (having a default value of shuffle partitions - 200) that let's say already started processing a few seconds/minutes ago?

  2. What would be the best approach in such a situation? Should such task/-s with different shuffle partition values be isolated in the workflow?

Workflow illustration

Additional description:

  1. Each of Databricks Tasks follows the same logic: Read data from X source. Transform Write So in each Databricks task, there's only one action.
  2. Each Databricks task will start at different time - based on action.
  3. Some action with default shuffle.partitions (200) will start processing before the Databricks task with a changed shufle size to 800 will start processing.
  4. Everything happens in the same cluster.

So the question:

  1. How such a change of shuffle partition (for the whole cluster) will affect already processing jobs/tasks (Spark) which at the starting point had 200 partitions?
  2. Will it change to 800 for all the already initiated job/tasks (Spark)? If so, won't it affect somehow the process, before such job is finished?

Upvotes: 0

Views: 118

Answers (1)

Ged
Ged

Reputation: 18098

In a Spark App, you can - by setting Futures or using .par (not that I recommend that), run multiple independent Actions with own shuffle.partitions for the independent Action itself. So, there can be multiple Actions with their own sets of parameters, as the Actions are independent of each other.

In Databricks Workflows, Jobs with Task(s) that run independently that can be seen as an independent Action or even Actions, are created. Having looked in the Databricks Sandbox, the following.

The way Workflows work is that you can set:

  1. Job level Spark Conf's.
  2. Or you can set Task level Spark Conf's via parameters as shown in diagram below.
  3. You can set spark conf's in the Notebook itself.

enter image description here

In that sense, you need not worry about the questions you raise as all Tasks run as (an) independent Action(s) and you can define the Spark Conf's at a lower level than Job level. If you don't do this, then you do get the global Spark Conf's, but you can override them as per above.

Moreover, those changes to Spark Conf's occuring at run time at Task level do not propagate to other Tasks.

Upvotes: 0

Related Questions