Why are there so many MapWithStateRDDs in Spark

Question

I have a Spark streaming application that uses stateful transformations quite a bit. In retrospect, Spark might have not been the best choice, but I'm still trying to make it work.

My question is that why do my MapWithStateRDDs take up so much memory? As an example, I have a transformation where the state is around 1.5 Gb in memory, and I see that same RDD being restored for each batch. So after the 3rd batch it shows on the UI that there are 3 MapWithStateRDDs with the exact same size while the state didn't change in those batches. Do these actually take up 3x the space? That seems like a huge waste, shouldn't it only store the deltas until a checkpoint and then compact them ibto one RDD or something like that? I assumed that's how it works, and having more stateful transformations eats up a lot of memory.

Why are there so many MapWithStateRDDs in Spark

Answers (1)

Related Questions