Reputation: 1982
I always understood that persist()
and cache()
, then action to activate the DAG, will calculate and keep the result in memory for later use. A lot of threads here will tell you to cache to enhance the performance of frequently used dataframe.
Recently I did a test and was confused because that does not seem to be the case.
temp_tab_name = "mytablename";
x = spark.sql("select * from " +temp_tab_name +" limit 10");
x = x.persist()
x.count() #action to activate all the above steps
x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
x.is_cached #True
spark.sql("drop table "+ temp_tab_name);
x.is_cached #Still true!!
x.show() # Error, table not found here
So it seems to me that x is never calculated and persisted. The next reference to x still goes back to evaluate its DAG definition "select..."
.Anything I missed here ?
Upvotes: 3
Views: 7707
Reputation: 43
I know this is an old question, but I found something interesting about your example.
You cached your dataframe, performed an action and then dropped the source table. In Spark 2 your last dataframe.show() would've worked perfectly. But, In Spark 3 there was a change that whenever you change the source table all caches are flushed. You can confirm this on this link: Upgrading from Spark SQL 3.1 to 3.2.
So, yes, Spark is really caching your data, but, any refreshing operation on table will flush your cached dataframe.
Upvotes: 0
Reputation: 2767
The correct syntax is below ... here is some additional documentation for "uncaching" tables => https://spark.apache.org/docs/latest/sql-performance-tuning.html ... and you can confirm the examples below in the Spark UI under "storage" tab to see the objects being "cached" and "uncached"
# df method
df = spark.range(10)
df.cache() # cache
# df.persist() # acts same as cache
df.count() # action to materialize df object in ram
# df.foreach(lambda x: x) # another action to materialize df object in ram
df.unpersist() # remove df object from ram
# temp table method
df.createOrReplaceTempView("df_sql")
spark.catalog.cacheTable("df_sql") # cache
spark.sql("select * from df_sql").count() # action to materialize temp table in ram
spark.catalog.uncacheTable("df_sql") # remove temp table from ram
Upvotes: 2
Reputation: 4017
cache
and persist
don't completely detach computation result from the source.
It just makes best-effort for avoiding recalculation. So, generally speaking, deleting source before you are done with the dataset is a bad idea.
What could go wrong in your particular case (from the top of my head):
1) show
doesn't need all records of the table so maybe it triggers computation only for few partitions. So most of the partitions are still not calculated at this point.
2) spark needs some auxilliary information from the table (e.g. for partitioning)
Upvotes: 2