RobLW
RobLW

Reputation: 94

Databricks Delta storage - Caching tables for performance

While investigating ways of trying to improve the performance of some queries I bumped into Delta storage Cache options, it has left me with several questions. (a little knowledge is dangerous)

spark.conf.set("spark.databricks.io.cache.enabled", "true")

cache select * from tablename

I've basically got 3 tables that will be used a lot for analysis and I wanted to improve performance. I've created them as Delta storage, partitioned on columns I think are likely to be most commonly used for filtering clauses (but not too high cardinality), and applied zorder on a column that matches all 3 tables and will be used in all joins between them. I'm now exploring caching options to see if I can improve performance even more.

Upvotes: 3

Views: 3653

Answers (1)

Ged
Ged

Reputation: 18023

See https://docs.databricks.com/delta/optimizations/delta-cache.html

In short:

  • It applies to your cluster and has nothing to do with your notebook.

  • It does not support CSV, JSON, and ORC.

  • Your choice of cluster config can affect the setup and operation. See URI.

  • You can use Delta caching and Apache Spark caching at the same time. E.g. the Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. That is what Spark caching is for.

Upvotes: 2

Related Questions