user9694017
user9694017

Reputation:

Spark UI on Google Dataproc: numbers interpretation

I'm running a spark job on a Google Dataproc cluster (3 nodes n1-highmem-4 so 4 cores and 26GB each, same type for the master). I have a few questions about informations displayed on the Hadoop and the spark UI:

When I check the Hadoop UI I get this:

Hadoop UI screenshot

My question here is : my total RAM is supposed to be 84 (3x26) so why only 60Gb displayed here ? Is 24GB used for something else ?

2) Spark Excutor screenshot

This is the screen showing currently launched executors. My questions are:

3) enter image description here

This is a screenshot of the "Storage" panel. I see the dataframe I'm working on. I don't understand the "size in memory" column. Is it the total RAM used to cache the dataframe ? It seems very low compared to the size of row files I load into the Dataframe ( 500GB+ ). Is it a wrong interpretation ?

Thanks to anyone who will read this !

Upvotes: 2

Views: 365

Answers (1)

Henry Gong
Henry Gong

Reputation: 316

If you can take a look at this answer, it mostly answers your question 1 and 2.

To sum up, the total memory is less because some memory are reserved to run OS and system daemons or Hadoop daemons itself, e.g.Namenode, NodeManager.

Similar to cores, in your case it would be 3 nodes and each node runs 2 executors and each executor uses up 2 cores, except for the application master. For the node that application master lives in, there will be only one executor and the cores left are given to master. That's why you see only 5 executor and 10 cores.

For your 3rd question, that number should be the memory used up by the partitions in that RDD, which is approximately equal to memory allocated to each executor in your case it's ~13G.

Note that Spark doesn't load your 500G data at once instead it loads in data in partitions, the number of concurrently loaded partitions depend on the number of cores you have available.

Upvotes: 1

Related Questions