Reputation: 770
If I have a Hadoop Cluster of, say, 3 data nodes and 1 name node and in spark code I use something like dataframe.persist(MEMORY_AND_DISK)
, where does this data get persisted? Is it in Namenode's (driver) memory or Datanode's (executor) memory or both?
Also, does the storage of the cached data depend on heap size? If so, how can I increase the heap size for all the nodes?
Upvotes: 1
Views: 654
Reputation: 191844
The NameNode is not a driver, and the Datanode is not an executor. All Spark processes in the YARN framework happen in ResourceManagers (which are often on DataNodes, yes), but they have their own temporary storage per application, as set by the YARN configuration.
The Storage tab of the Spark UI might tell you where the files are actually located, if you needed to find them.
You increase the heap size by increasing the executor/driver memory respectively, with respect to the size of your YARN container sizes.
Upvotes: 3