Spark cache inefficiency

Question

I have quite powerful cluster with 3 nodes each 24 cores and 96gb RAM = 288gb total. I try to load 100gb of tsv files into Spark cache and do series of simple computation over data, like sum(col20) by col2-col4 combinations. I think it's clear scenario for cache usage.

But during Spark execution I found out that cache NEVER load 100% of data despite plenty of RAM space. After 1 hour of execution I have 70% of partitions in cache and 75gb cache usage out of 170gb available. It's looks like Spark somehow limit number of blocks/partitions it adds to cache instead to add all at very first action and have a great performance from very beginning.

I use MEMORY_ONLY_SER / Kryo ( cache size appr. 110% of on-disk data size )

Does someone have similar experience or know some Spark configs / environment conditions that could cause this caching behaviour ?

Spark cache inefficiency

Answers (1)

Related Questions