How executors will process the data when input data size is huge?

Question

Let's assume the configuration 14 executors, each with 10 GB memory and my input data size is 1000GB.

Q1 - How executors will process the huge data because data size is more and how partion logic works here ?

Q2 - How caching can help here for better performance and what caching strategy can be used ? (Mem only , Mem and disk etc )

Garren S · Accepted Answer

1) Your 1000GB input would be split into blocks/chunks/partitions that are usually ~128MB each.

2) Caching should generally only be done if the data is computationally (or bandwidth) expensive, your cluster has enough memory, including overhead. You simply can't cache your entire 1000GB input with only 140GB total memory, but you may be able to cache other data sets for joins or an aggregate of the 1000GB input.

For more background information about #1, see these Q&As:

How are stages split into tasks in Spark?

How to split the input file in Apache Spark

How does the Apache Spark scheduler split files into tasks?

How executors will process the data when input data size is huge?

Answers (1)

Related Questions