NPE
NPE

Reputation: 1421

Hadoop - data block caching techniques

I am running some experiments to benchmark the time it takes (by map-reduce) to read and process data stored on HDFS with varying parameters. I use pig script to launch map-reduce jobs. Since I am working with the same set of files frequently, my results may get affected because of file/block caching.

I want to understand the various caching techniques employed in a map-reduce environment.

Lets say that a file foo (contains some data to be procesed) stored on HDFS occupies 1 HDFS block and it gets stored in machine STORE. During a map-reduce task, machine COMPUTE reads that block over network and processes it. Caching can happen at two levels:

  1. Cached in memory of machine STORE (in-memory file cache)
  2. Cached in memory/disk of machine COMPUTE.

I am pretty sure that #1 caching happens. I want to ensure whether something like #2 happens? From the post here, it looks like there is no client level caching going on since it is very unlikely that the block cached by COMPUTE will be needed again in the same machine before the cache is flushed.

Also, is the hadoop distributed cache used only to distribute any application specific files (not task specific input data files) to all task tracker nodes? Or is the task specific input file data (like the foo file block) cached in the distributed cache? I assume local.cache.size and related parameters only control the distributed cache.

Please clarify.

Upvotes: 1

Views: 1757

Answers (1)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

The only caching that is ever applied within HDFS is the OS caching to minimize disk accesses. So if you access a block from a datanode, it is likely to be cached if nothing else is going on there.

On your client side, this depends on what you do with the block. If you directly write it to disk, it is also very likely that your client OS caches it.

The distributed cache is just for jars and files that need to be distributed across the cluster where your job launches tasks. The name is thus a bit misleading, as it "caches" nothing.

Upvotes: 3

Related Questions