Reputation: 1265
I am starting to get more confused as I keep reading online resources about Spark architecture and scheduling. One resource says that: The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. On the other hand: Spark maps the number tasks on a particular Executor to the number of cores allocated to it. So, the first resource says that if I have 1000 partitions then I will have 1000 tasks no matter what my machine is. In the second case, If I have 4 core machine and 1000 partitions then what? I will have 4 tasks? Then how the data is processed?
Another confusion: each worker can process one task at a time and Executors can run multiple tasks over its lifetime, both in parallel and sequentially. So are tasks sequential or parallel?
Upvotes: 0
Views: 1993
Reputation: 27373
spark.task.cpus
is configured to something else than 1 (which is the default value)So think of tasks as some (independent) chunk of work which has to be processed. They can surely run in parallel
So if you have 1000 partitions and 5 executors which 4 cores each, 20 tasks will generally run in parallel
Upvotes: 3