Reputation: 875
Data locality as defined by many Hadoop tutorial sites (i.e. https://techvidvan.com/tutorials/data-locality-in-hadoop-mapreduce/) states that: "Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion."
I can understand having the node where the data resides process the computation for those data, instead of moving data around, would be efficient. However, what does it mean by "moving the computation close to where the actual data resides"? Does this mean that if the data sits in a server in Germany, it is better to use the server in France to do the computation on those data instead of using the server in Singapore to do the computation since France is closer to Germany than Singapore?
Upvotes: 3
Views: 2130
Reputation: 1
I think there is another way of explaining this. I found a YouTube video which explains this aspect in different perspective and I found it to be useful. Feel free to correct the content I share.
Here's the link for the video.
https://www.youtube.com/watch?v=LgKrqqTfU7E
Upvotes: 0
Reputation: 9425
I +1 Dennis Jaheruddin's answer, and just wanted to add -- you can actually see different locality levels in MR when you check job counters, in Job History UI for example.
HDFS and YARN are rack-aware so its not just binary same-or-other node: in the above screen, Data-local
means the task was running local to the machine that contained actual data; Rack-local
-- that the data wasn't local to the node running the task and needed to be copied, but was still on the same rack; and finally the Other local
case -- where the data wasn't available local, nor on the same rack, so it had to be copied over two switches to the node that run the computation.
Upvotes: 2
Reputation: 21561
Typically people talk about this on a quite different scale, especially within a Hadoop context.
Suppose you have a cluster of 5 nodes, you store a file there and need to do a calculation on it.
With data locality you try to make the calculation happen on the node(s) where the data is stored (rather than for example the first node that has compute resources available).
This reduces network load.
It is good to realize that in many new infrastructures the network is not the bottleneck, so you will keep hearing more about the decoupling of compute and storage.
Upvotes: 3