Pawel
Pawel

Reputation: 1

Spark number of input partitions vs number of reading tasks

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?

I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:

Results:

I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?

Upvotes: 0

Views: 1192

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6994

Spark would need to create total of 14 tasks to process the file with 14 partitions. Each task will be assigned to a partition per stage.

Now, if you have provided more resources, the spark will parallelize the tasks more. So you would see more tasks are started when the spark starts processing. However, those tasks will finish and a new set of tasks will start depending on the resources you have provided. Overall, the spark will fork 14 tasks to process the file.

Spark won't reduce the partitions of the file unless you repartition the file or coalesce.

Upvotes: 0

Related Questions