Caching in RAM using HDFS

Question

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).

I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.

Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.

The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.

Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?

Michael Hausenblas · Accepted Answer

You can toy around with mapred.job.reduce.input.buffer.percent (defaults to 0, try something closer to 1.0, see for example this blog post) and also setting the value of mapred.inmem.merge.threshold to 0. Note that finding the right values is a bit of an art and requires some experimentation.

Caching in RAM using HDFS

Answers (2)

Related Questions