shujaat
shujaat

Reputation: 389

Distributed Cache Concept in Hadoop

My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".

This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.

Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.

Is my understanding correct or am i missing something? Please guide.

Thanks

Upvotes: 1

Views: 452

Answers (2)

Ravindra babu
Ravindra babu

Reputation: 38950

I too agree that it's not really "Distributed cache". But I am convinced with YoungHobbit comments on efficiency of not hitting disk for IO operations.

The only merit I have seen in this mechanism as per Apache documentation:

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

Please note that DistributedCache has been deprecated since 2.6.0 release. You have to use new APIs in Job class to achieve the same functionality.

Upvotes: 1

YoungHobbit
YoungHobbit

Reputation: 13402

The general understanding and concept of any Cache is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.

Now lets take the same analogy to Hadoop ecosystem. Here disk is your HDFS and memory is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS vs getting it from local file system. The is the concept of Distributed Cache in MapReduce framework.

The size of the data is usually small enough that it can be loaded in the Mapper memory, usually in few MBs.

Upvotes: 1

Related Questions