Reputation: 94
While investigating ways of trying to improve the performance of some queries I bumped into Delta storage Cache options, it has left me with several questions. (a little knowledge is dangerous)
spark.conf.set("spark.databricks.io.cache.enabled", "true")
cache select * from tablename
I've basically got 3 tables that will be used a lot for analysis and I wanted to improve performance. I've created them as Delta storage, partitioned on columns I think are likely to be most commonly used for filtering clauses (but not too high cardinality), and applied zorder on a column that matches all 3 tables and will be used in all joins between them. I'm now exploring caching options to see if I can improve performance even more.
Upvotes: 3
Views: 3653
Reputation: 18023
See https://docs.databricks.com/delta/optimizations/delta-cache.html
In short:
It applies to your cluster and has nothing to do with your notebook.
It does not support CSV, JSON, and ORC.
Your choice of cluster config can affect the setup and operation. See URI.
You can use Delta caching and Apache Spark caching at the same time. E.g. the Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. That is what Spark caching is for.
Upvotes: 2