user11704694
user11704694

Reputation:

Storage options in databricks

I am relatively new to databricks environment. My company has set up a databricks account for me where I am pulling data from a s3 bucket. I have background in traditional relational databases so it's a bit difficult for me to understand databricks.

I have following questions:

-Is mount just a connection(link to s3/external storage) with nothing stored on DBFS or it actually stores the data on DBFS ?

-I read somewhere that DBFS is also mount ? My understanding is that DBFS is databricks storage , how can I see what's the total storage available for the DBFS ?

-We have different clusters for different teams with in the company, I don't have access to all the clusters, while exporting the data from s3 do I have to set up something in my code, to ensure that the dataframe and tables which I am creating in databricks are not accessible to other users who are not part of the cluster which I am using.

-Where are the database tables stored? Is it on DBFS? In terms of storage options , is there any other storage apart from databases, DBFS,external(s3,azure,jdbc/odbc etc)?

-Are tables/dataframes always stored in-memory when we load them? Is there a way to see what's the limit for file size in in-memory ?

Thanks!

Upvotes: 10

Views: 7659

Answers (1)

Raphael K
Raphael K

Reputation: 2353

Good questions! I'll do my best to answer these for you.

Is mount just a connection(link to s3/external storage) with nothing stored on DBFS or it actually stores the data on DBFS? I read somewhere that DBFS is also mount? My understanding is that DBFS is databricks storage , how can I see what's the total storage available for the DBFS?

DBFS is an abstraction layer on top of S3 that lets you access data as if it were a local file system. By default when you deploy Databricks you create a bucket that is used for storage and can be accessed via DBFS. When you mount to DBFS, you are essentially mounting a S3 bucket to a path on DBFS. More details here.

We have different clusters for different teams with in the company, I don't have access to all the clusters, while exporting the data from s3 do I have to set up something in my code, to ensure that the dataframe and tables which I am creating in databricks are not accessible to other users who are not part of the cluster which I am using.

Mounting an S3 bucket to a path on DBFS will make that data available to others in your Databricks workspace. If you want to make sure no one else can access the data, you will have to take two steps. First, use IAM roles instead of mounts and attach the IAM role that grants access to the S3 bucket to the cluster you plan on using. Second, restrict access to the cluster to only those who can access the data. This way you lock down which clusters can access the data, and which users can access those clusters.

Where are the database tables stored? Is it on DBFS? In terms of storage options , is there any other storage apart from databases, DBFS,external(s3,azure,jdbc/odbc etc)?

Database tables are stored on DBFS, typically under the /FileStore/tables path. Read more here.

Are tables/dataframes always stored in-memory when we load them? Is there a way to see what's the limit for file size in in-memory?

This depends on your query. If your query is SELECT count(*) FROM table then yes, the entire table is loaded into memory. If you are filtering then Spark will try to be efficient and only read those portions of the table that are necessary to execute the query. The limit for the file size is proportional to the size of your cluster. Spark will partition data in memory across the cluster. If you are still running out of memory then it's usually time to increase the size of your cluster or refine your query. Autoscaling on Databricks helps with the former.

Thanks!

You're welcome!

Edit: In memory refers to RAM, DBFS does no processing. To see the available space you have to log into your AWS/Azure account and check the S3/ADLS storage associated with Databricks.

If you save tables through Spark APIs they will be on the FileStore/tables path as well. The UI leverages the same path.

Clusters are comprised of a driver node and worker nodes. You can have a different EC2 instance for the driver if you want. It's all one system, that system being the cluster.

Spark supports partitioning the parquet files associated with tables. In fact, this is a key strategy to improving the performance of your queries. Partition by a predicate that you commonly use in your queries. This is separate from partitioning in memory.

Upvotes: 16

Related Questions