Reputation: 389
My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".
This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.
Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.
Is my understanding correct or am i missing something? Please guide.
Thanks
Upvotes: 1
Views: 452
Reputation: 38950
I too agree that it's not really "Distributed cache
". But I am convinced with YoungHobbit comments on efficiency of not hitting disk for IO operations.
The only merit I have seen in this mechanism as per Apache documentation:
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
Please note that DistributedCache
has been deprecated since 2.6.0 release. You have to use new APIs in Job class to achieve the same functionality.
Upvotes: 1
Reputation: 13402
The general understanding and concept of any Cache
is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.
Now lets take the same analogy to Hadoop
ecosystem. Here disk is your HDFS
and memory
is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS
and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS
vs getting it from local file system. The is the concept of Distributed Cache
in MapReduce
framework.
The size of the data is usually small enough that it can be loaded in the Mapper
memory, usually in few MBs.
Upvotes: 1