Reputation: 3633

Spark directs shuffle output to disk even when there is plenty of RAM

We have a Spark cluster without local disks and spilling is setup up to go to NFS. We realize that this is not how Spark was designed to be used, but such are our current realities.

In this world, spills slow down Spark jobs a great deal and we would like to minimize them. For most jobs we have, Spark executors have enough RAM to hold all intermediate computation results, but we see that Spark always writing shuffle results to disk, i.e. to NFS in out case. We have played with all Spark config options that looked vaguely related to try making Spark write shuffle outputs to RAM to no avail.

I have seen in a few places, like Does Spark write intermediate shuffle outputs to disk, that Spark prefers to write shuffle output to disk. My questions are:

Is there a way to make Spark use RAM for shuffle outputs when there is RAM available?

If not, what would be a way to make it use fewer larger writes? We see it doing a lot of small 1-5KB writes and waiting for NFS latency after every write. The following config options didn't help:

spark.buffer.size
spark.shuffle.spill.batchSize
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.file.buffer
spark.shuffle.unsafe.file.output.buffer
spark.shuffle.sort.initialBufferSize
spark.io.compression.*.blockSize

Upvotes: 7

Answers (2)

Jose

Reputation: 673

You decrease the number of spills with the option spark.memory.storageFraction, the default value is 0.5, you may try to decrease it to avoid spills, but do not set it up to 0.

The documentation explains it as:

Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction. The higher this is, the less working memory may be available to execution and tasks may spill to disk more often. Leaving this at the default value is recommended. For more detail, see this description.

I don't know if you also try it, but another way to decrease the spills is with the option spark.memory.fraction. The value by default is 0.6, and the maximum is 1.0. The documentation is describes as:

Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.

Also can be useful to check some documentation about tuning the memory management here

Upvotes: 1

bazza

Reputation: 8414

I've no experience with apache-spark whatsoever, but one way of ensuring that anything "intermediate" goes to RAM and not network attached storage is to set up a RAM drive and point any file I/O at that. It might be a bit slower that objects simply being kept in memory, but it'll be a whole lot quicker than being stored on an NFS server.

This would also be a way of controlling "how much" RAM is used, if something is otherwise going to allocate RAM to exhaustion. A RAM drive can have a defined maximum size, and file I/O will stop working on it once it is full. Whereas if an application hasn't got a "leave XX GB RAM spare" configuration for limiting how much it's going to use, it might run away allocating all the RAM for itself and the machine comes to a grinding halt.

Upvotes: 0

Spark directs shuffle output to disk even when there is plenty of RAM

Answers (2)

Related Questions