Spark number of input partitions vs number of reading tasks

Question

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?

I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:

test 1 - I loaded this dataset using 2 executors, 2 cores each
test 2 - I loaded this dataset using 4 executors, 2 cores each

Results:

test 1 - Spark created 5 tasks to read data (in each task ~18 MB was loaded)
test 2 - Spark created 7 tasks to read data (in each task ~13 MB was loaded)

I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?

Spark number of input partitions vs number of reading tasks

Answers (1)

Related Questions