Is spark persist() (then action) really persisting?

I always understood that persist() and cache(), then action to activate the DAG, will calculate and keep the result in memory for later use. A lot of threads here will tell you to cache to enhance the performance of frequently used dataframe.

Recently I did a test and was confused because that does not seem to be the case.

    temp_tab_name = "mytablename";
    x = spark.sql("select * from " +temp_tab_name +" limit 10");
    x = x.persist()
    x.count() #action to activate all the above steps
    x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
    x.is_cached #True
    spark.sql("drop table "+ temp_tab_name);
    x.is_cached #Still true!!
    x.show() # Error, table not found here

So it seems to me that x is never calculated and persisted. The next reference to x still goes back to evaluate its DAG definition "select..." .Anything I missed here ?

Upvotes: 3

Answers (3)

dfeprado

Reputation: 43

I know this is an old question, but I found something interesting about your example.

You cached your dataframe, performed an action and then dropped the source table. In Spark 2 your last dataframe.show() would've worked perfectly. But, In Spark 3 there was a change that whenever you change the source table all caches are flushed. You can confirm this on this link: Upgrading from Spark SQL 3.1 to 3.2.

So, yes, Spark is really caching your data, but, any refreshing operation on table will flush your cached dataframe.

Upvotes: 0

thePurplePython

Reputation: 2767

The correct syntax is below ... here is some additional documentation for "uncaching" tables => https://spark.apache.org/docs/latest/sql-performance-tuning.html ... and you can confirm the examples below in the Spark UI under "storage" tab to see the objects being "cached" and "uncached"

# df method
df = spark.range(10)
df.cache() # cache
# df.persist() # acts same as cache
df.count() # action to materialize df object in ram
# df.foreach(lambda x: x) # another action to materialize df object in ram
df.unpersist() # remove df object from ram

# temp table method
df.createOrReplaceTempView("df_sql")
spark.catalog.cacheTable("df_sql") # cache
spark.sql("select * from df_sql").count() # action to materialize temp table in ram
spark.catalog.uncacheTable("df_sql") # remove temp table from ram

Upvotes: 2

simpadjo

Reputation: 4017

cache and persist don't completely detach computation result from the source. It just makes best-effort for avoiding recalculation. So, generally speaking, deleting source before you are done with the dataset is a bad idea.

What could go wrong in your particular case (from the top of my head):

~~1) show doesn't need all records of the table so maybe it triggers computation only for few partitions. So most of the partitions are still not calculated at this point.~~

2) spark needs some auxilliary information from the table (e.g. for partitioning)

Upvotes: 2

Is spark persist() (then action) really persisting?

Answers (3)

Related Questions