How to force spark to avoid Dataset re-computation?

Question

I've a dataset which is loaded from cassandra in spark. After loading this dataset, I will remove some of the items from cassandra, but I want my dataset being as first for the next computation. I've used persist(DISK_ONLY) to solve it, but it seems to best effort. How can I force spark to avoid re-computation?

example:

 val dataset:Dataset[Int] = ??? // something from cassandra
 dataset.persist(StorageLevel.DISK_ONLY) // it's best effort
 dataset.count // = 2n
 dataset.persist(_ % 2 == 0).remove // remove from cassandra
 data.count // = n => I need orginal dataset here

How to force spark to avoid Dataset re-computation?

Answers (1)

Related Questions