Reputation: 4373
I've a dataset which is loaded from cassandra in spark. After loading this dataset, I will remove some of the items from cassandra, but I want my dataset being as first for the next computation. I've used persist(DISK_ONLY)
to solve it, but it seems to best effort.
How can I force spark to avoid re-computation?
example:
val dataset:Dataset[Int] = ??? // something from cassandra
dataset.persist(StorageLevel.DISK_ONLY) // it's best effort
dataset.count // = 2n
dataset.persist(_ % 2 == 0).remove // remove from cassandra
data.count // = n => I need orginal dataset here
Upvotes: 0
Views: 1154
Reputation: 330413
Spark cache
is not intended to be used this way. It is an optimization, and even with the most conservative StorageLevels
(DISK_ONLY_2
), data can be lost and recomputed in case of worker failure or decommissioning.
Checkpoint
to a reliable file system might be a better option, but I suspect there might be some border cases, which can result in the data loss.
Yo ensure correctness I would strongly recommend at least writing intermediate data to a persistent storage, like distributed file system, and reading it back:
dataset.write.format(...).save("persisted/location")
... // Remove data from the source
spark.read.format(...).load("persisted/location") //reading the same again
Upvotes: 2