Reputation: 2312
I a newbie to spark. I am using Azure Databricks
and I am writing python code with PySpark
. There is one particular topic which is confusing me:
Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context
exchange data accordingly?
I don't understand why sometime we refer to the storage by dbfs:/tmp/...
, and other times we refer to it by /dbfs/tmp/
... Example: If I am using the dbutils
package from databricks, we use something like: dbfs:/tmp/...
to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/
.
Your help is much appreciated!!
Upvotes: 0
Views: 1393
Reputation: 1751
Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.
This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop
Upvotes: 1