Do nodes in Spark Cluster share the same storage?

Question

I a newbie to spark. I am using Azure Databricks and I am writing python code with PySpark. There is one particular topic which is confusing me:

Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context exchange data accordingly?

I don't understand why sometime we refer to the storage by dbfs:/tmp/..., and other times we refer to it by /dbfs/tmp/... Example: If I am using the dbutils package from databricks, we use something like: dbfs:/tmp/... to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/.

Your help is much appreciated!!

Piyush Patel · Accepted Answer

Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.

This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop

Do nodes in Spark Cluster share the same storage?

Answers (1)

Related Questions