In spark, is there any way to unpersist a dataframe/rdd in the middle of execution plan

Question

Given the following series of events:

df1 = read
df2 = df1.action
df3 = df1.action
df2a = df2.action
df2b = df2.action
df3a = df3.action
df3b = df3.action
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()

The data forks twice, so that df1 will be read 4 times. I therefore want to persist the data. From what I understand this is the way to do so:

df1 = read
df1.persist()
df2 = df1.action
df3 = df1.action
df2.persist()
df3.persist()
df2a = df2.action
df2b = df2.action
df3a = df3.action
df3b = df3.action
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()
df1.unpersist()
df2.unpersist()
df3.unpersist()

However this keeps all three in memory at once, which isn't storage efficient considering I no longer need df1 persisted after df2 and df3 are both created. I'd like to order it more like this:

df1 = read
df1.persist()
df2 = df1.action
df3 = df1.action
df1.unpersist()
df2.persist()
df3.persist()
df2a = df2.action
df2b = df2.action
df2.unpersist()
df3a = df3.action
df3b = df3.action
df3.unpersist()
df4 = union(df2a, df2b, df3a, d3b)
df4.collect()

However this just leads to the data not being persisted at all, because I need to trigger an action before unpersisting. Is there any way to accomplish what I'm looking for (unpersisting intermediate dataframes in the middle of the execution plan)?

In spark, is there any way to unpersist a dataframe/rdd in the middle of execution plan

Answers (1)

Related Questions