Reputation: 61
I have quite powerful cluster with 3 nodes each 24 cores and 96gb RAM = 288gb total. I try to load 100gb of tsv files into Spark cache and do series of simple computation over data, like sum(col20) by col2-col4 combinations. I think it's clear scenario for cache usage.
But during Spark execution I found out that cache NEVER load 100% of data despite plenty of RAM space. After 1 hour of execution I have 70% of partitions in cache and 75gb cache usage out of 170gb available. It's looks like Spark somehow limit number of blocks/partitions it adds to cache instead to add all at very first action and have a great performance from very beginning.
I use MEMORY_ONLY_SER / Kryo ( cache size appr. 110% of on-disk data size )
Does someone have similar experience or know some Spark configs / environment conditions that could cause this caching behaviour ?
Upvotes: 1
Views: 399
Reputation: 61
So, "problem" was solved with further reducing of split size. With mapreduce.input.fileinputformat.split.maxsize set to 100mb I got 98% cache load after 1st action finished, and 100% at 2nd action.
Other thing that worsened my results was spark.speculation=true - I try to avoid long-running tasks with that, but speculation management creates big performance overhead, and is useless for my case. So, just left default value for spark.speculation ( false )
My performance comparison for 20 queries are as following:
- without cache - 160 minutes ( 20 times x 8 min, reload each time 100gb from disk to memory )
- cache - 33 minutes total - 10m to load cache 100% ( during first 2 queries ) and 18 queries x 1.5 minutes each ( from in-memory Kryo-serialized cache )
Upvotes: 0