Arun
Arun

Reputation: 770

Where does an RDD get stored?

If I have a Hadoop Cluster of, say, 3 data nodes and 1 name node and in spark code I use something like dataframe.persist(MEMORY_AND_DISK), where does this data get persisted? Is it in Namenode's (driver) memory or Datanode's (executor) memory or both?

Also, does the storage of the cached data depend on heap size? If so, how can I increase the heap size for all the nodes?

Upvotes: 1

Views: 654

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191844

The NameNode is not a driver, and the Datanode is not an executor. All Spark processes in the YARN framework happen in ResourceManagers (which are often on DataNodes, yes), but they have their own temporary storage per application, as set by the YARN configuration.

The Storage tab of the Spark UI might tell you where the files are actually located, if you needed to find them.

You increase the heap size by increasing the executor/driver memory respectively, with respect to the size of your YARN container sizes.

Upvotes: 3

Related Questions