Reputation: 3507
I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).
I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.
Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.
The only "solution" I found is to set dfs.datanode.data.dir
as /dev/shm/
in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.
Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?
Upvotes: 6
Views: 1933
Reputation: 4094
Since the release of Hadoop 2.3 you can use HDFS in memory caching.
Upvotes: 2
Reputation: 13941
You can toy around with mapred.job.reduce.input.buffer.percent
(defaults to 0
, try something closer to 1.0
, see for example this blog post) and also setting the value of mapred.inmem.merge.threshold
to 0
. Note that finding the right values is a bit of an art and requires some experimentation.
Upvotes: 1