Moein Hosseini
Moein Hosseini

Reputation: 4373

How to force spark to avoid Dataset re-computation?

I've a dataset which is loaded from cassandra in spark. After loading this dataset, I will remove some of the items from cassandra, but I want my dataset being as first for the next computation. I've used persist(DISK_ONLY) to solve it, but it seems to best effort. How can I force spark to avoid re-computation?

example:

 val dataset:Dataset[Int] = ??? // something from cassandra
 dataset.persist(StorageLevel.DISK_ONLY) // it's best effort
 dataset.count // = 2n
 dataset.persist(_ % 2 == 0).remove // remove from cassandra
 data.count // = n => I need orginal dataset here

Upvotes: 0

Views: 1154

Answers (1)

zero323
zero323

Reputation: 330413

Spark cache is not intended to be used this way. It is an optimization, and even with the most conservative StorageLevels (DISK_ONLY_2), data can be lost and recomputed in case of worker failure or decommissioning.

Checkpoint to a reliable file system might be a better option, but I suspect there might be some border cases, which can result in the data loss.

Yo ensure correctness I would strongly recommend at least writing intermediate data to a persistent storage, like distributed file system, and reading it back:

dataset.write.format(...).save("persisted/location")
... // Remove data from the source
spark.read.format(...).load("persisted/location") //reading the same again

Upvotes: 2

Related Questions