Reputation: 11
I am trying to understand how data is stored and managed in the DataBricks environment. I have a fairly decent understanding of what is going on under the hood but have seen some conflicting information online, therefore would like to get a detailed explanation to solidify my understanding. To ask my questions, I'd like to summarize what I have done as as part of one of exercises in the Apache Spark Developer course.
As a part of the exercise, I have followed the following steps on the Databricks platform:
After following the above steps, here's how my DBFS directory looks:
In the root folder which I used to store the Delta Table (picture above) I have the following types folders/files
Based on the above exercise following are my questions:
Some of the answers I looked at online say that all the partitions are stored in memory (RAM). By that logic, once I turn off my cluster - they should be removed from memory , right?
However, even when I turn off my cluster I am able to view all the data in DBFS (exactly similar to the picture I have included above) . I suspect once the cluster is turned off, the RAM would be cleared therefore, I should not be able to see any data that is in my RAM. Is my understanding incorrect?
Would appreciate if you can answer my questions in order with as much detail as possible.
Upvotes: 1
Views: 888
Reputation: 3676
When you write out the data to DBFS it is stored in some form of permanent object storage separate from your cluster. This is why it is still there after the cluster shuts down. What storage this is depends on which cloud you are running your Databricks workspace.
This is the main idea of separating compute and storage, your clusters are the compute and the storage elsewhere. When you read in and process the data only then it is distributed across your nodes for processing. Once your cluster shuts down all data on the nodes RAM or disk is gone unless you've written it out to some form of permanent storage.
Upvotes: 0