Reputation: 18991
Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This chunk is called a Spark partition. It is a collection of rows that sits on one physical machine in the cluster. Executors are responsible for carrying out the work assigned by the driver. Each executor is responsible for two things: (1) execute code assigned by the driver, (2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
Upvotes: 3
Views: 517
Reputation: 3824
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.
Upvotes: 1