Databricks / Spark storage mechanism for Delta Tables, Delta Logs, Partitions etc

Question

I am trying to understand how data is stored and managed in the DataBricks environment. I have a fairly decent understanding of what is going on under the hood but have seen some conflicting information online, therefore would like to get a detailed explanation to solidify my understanding. To ask my questions, I'd like to summarize what I have done as as part of one of exercises in the Apache Spark Developer course.

As a part of the exercise, I have followed the following steps on the Databricks platform:

Started my cluster
Read a parquet file as a DataFrame
Stored the DataFrame as a Delta Table in my user directory in DBFS
Made some changes to the Delta Table created in the previous step
Partitioned the same Delta table based on a specific column e.g. State and saved in the same user directory in DBFS using the overwrite mode

After following the above steps, here's how my DBFS directory looks:

DBFS Delta Log Directory

In the root folder which I used to store the Delta Table (picture above) I have the following types folders/files

Delta log folder
Folders with the 'State' name (step 5. previous section), Each state folder also contains 4 parquet files which I suspect are partitions of the dataset
Four Separate parquet files which I suspect are files from when I created this delta table (in Step 3 of the previous section)

Based on the above exercise following are my questions:

Is the data that I see in the above directory - State named folders that contain the partitions, parquet files, delta log etc distributed across my nodes (The answer I presume is yes).
The four parquet files in the root folder (from when I created the delta table, before the partition) - assuming they are distributed across my nodes - are they stored in my Node's RAM? Where is the data from the delta_log folder stored? If it's across my nodes - is it stored in RAM or disk memory?
Where is the data (parquet files/partitions under each state name folder - from screenshot above) stored? If this is also distributed across my nodes is it in memory (RAM) or on the disk?

Some of the answers I looked at online say that all the partitions are stored in memory (RAM). By that logic, once I turn off my cluster - they should be removed from memory , right?

However, even when I turn off my cluster I am able to view all the data in DBFS (exactly similar to the picture I have included above) . I suspect once the cluster is turned off, the RAM would be cleared therefore, I should not be able to see any data that is in my RAM. Is my understanding incorrect?

Would appreciate if you can answer my questions in order with as much detail as possible.

Databricks / Spark storage mechanism for Delta Tables, Delta Logs, Partitions etc

Answers (1)

Related Questions