Reputation: 494
I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions
is reverted to the default value (200).
Maybe someone can help me to understand what generally happens if one of the tasks processed in parallel with other tasks receives change of the spark configuration for shuffle partitions?
To be precise what happens with other entities (having a default value of shuffle partitions - 200) that let's say already started processing a few seconds/minutes ago?
What would be the best approach in such a situation? Should such task/-s with different shuffle partition values be isolated in the workflow?
Additional description:
So the question:
Upvotes: 0
Views: 118
Reputation: 18098
In a Spark App, you can - by setting Futures or using .par (not that I recommend that), run multiple independent Actions with own shuffle.partitions for the independent Action itself. So, there can be multiple Actions with their own sets of parameters, as the Actions are independent of each other.
In Databricks Workflows, Jobs with Task(s) that run independently that can be seen as an independent Action or even Actions, are created. Having looked in the Databricks Sandbox, the following.
The way Workflows work is that you can set:
In that sense, you need not worry about the questions you raise as all Tasks run as (an) independent Action(s) and you can define the Spark Conf's at a lower level than Job level. If you don't do this, then you do get the global Spark Conf's, but you can override them as per above.
Moreover, those changes to Spark Conf's occuring at run time at Task level do not propagate to other Tasks.
Upvotes: 0