guacacholay
guacacholay

Reputation: 11

Databricks / Spark storage mechanism for Delta Tables, Delta Logs, Partitions etc

I am trying to understand how data is stored and managed in the DataBricks environment. I have a fairly decent understanding of what is going on under the hood but have seen some conflicting information online, therefore would like to get a detailed explanation to solidify my understanding. To ask my questions, I'd like to summarize what I have done as as part of one of exercises in the Apache Spark Developer course.

As a part of the exercise, I have followed the following steps on the Databricks platform:

  1. Started my cluster
  2. Read a parquet file as a DataFrame
  3. Stored the DataFrame as a Delta Table in my user directory in DBFS
  4. Made some changes to the Delta Table created in the previous step
  5. Partitioned the same Delta table based on a specific column e.g. State and saved in the same user directory in DBFS using the overwrite mode

After following the above steps, here's how my DBFS directory looks:

DBFS Delta Log Directory

In the root folder which I used to store the Delta Table (picture above) I have the following types folders/files

  1. Delta log folder
  2. Folders with the 'State' name (step 5. previous section), Each state folder also contains 4 parquet files which I suspect are partitions of the dataset
  3. Four Separate parquet files which I suspect are files from when I created this delta table (in Step 3 of the previous section)

Based on the above exercise following are my questions:

  1. Is the data that I see in the above directory - State named folders that contain the partitions, parquet files, delta log etc distributed across my nodes (The answer I presume is yes).
  2. The four parquet files in the root folder (from when I created the delta table, before the partition) - assuming they are distributed across my nodes - are they stored in my Node's RAM? Where is the data from the delta_log folder stored? If it's across my nodes - is it stored in RAM or disk memory?
  3. Where is the data (parquet files/partitions under each state name folder - from screenshot above) stored? If this is also distributed across my nodes is it in memory (RAM) or on the disk?

Some of the answers I looked at online say that all the partitions are stored in memory (RAM). By that logic, once I turn off my cluster - they should be removed from memory , right?

However, even when I turn off my cluster I am able to view all the data in DBFS (exactly similar to the picture I have included above) . I suspect once the cluster is turned off, the RAM would be cleared therefore, I should not be able to see any data that is in my RAM. Is my understanding incorrect?

Would appreciate if you can answer my questions in order with as much detail as possible.

Upvotes: 1

Views: 888

Answers (1)

ScootCork
ScootCork

Reputation: 3676

When you write out the data to DBFS it is stored in some form of permanent object storage separate from your cluster. This is why it is still there after the cluster shuts down. What storage this is depends on which cloud you are running your Databricks workspace.

This is the main idea of separating compute and storage, your clusters are the compute and the storage elsewhere. When you read in and process the data only then it is distributed across your nodes for processing. Once your cluster shuts down all data on the nodes RAM or disk is gone unless you've written it out to some form of permanent storage.

Upvotes: 0

Related Questions