Reputation: 6550
I have been reading some literature on Hadoop Map/Reduce and a general theme seems to be is : Hadoop Jobs are I/O intensive (Example : Sorting with Map/Reduce).
What makes these jobs I/O intensive (Given the fact Hadoop pushes computation to data)? Example : Why is Sorting in Hadoop I/O intensive?
My intuition : It seems that after the map phase, the intermediate pairs are sent to reducers. Is this causing the huge I/O?
Upvotes: 3
Views: 3951
Reputation: 1
Performance of Hadoop and map reduce is limited. The reason for limited performance of both is file system that generates a lot of input and output files. It reduces the computational speed of both.The results of map reduce can be stored in memory so that it will speed up the computational speed.
Upvotes: 0
Reputation: 10919
In Hadoop's MapReduce framework, after each MapReduce operation, the results are written back to disk, which can be highly I/O intensive. Apache Spark improves on this by being able to store or cache small enough intermediate results in memory, as well as optimizing transformations as a DAG performed on RDDs. See How is Apache Spark different from the Hadoop approach?
Upvotes: 1
Reputation: 5018
Hadoop is used for performing computations over the large amounts of data. Your jobs might be bounded by the resources of IO (I/O intensive as you call it), CPU and Network. In the classic case of Hadoop usage you are performing local computations over huge amounts of input data while returning relatively small result set, which makes your task be more IO intensive than CPU and Network intensive, but it hugely depends on the job itself. Here are some examples:
You can refer to this guide for the initial tuning of the cluster So why sorting is IO intensive? First, you read the data from the disks. Next, in sorting the amount of data produced by mappers is the same amount that was read, means that most likely it wont fit in memory and should be spilled to the disks. Then it got transferred to reducers and spilled to the disks again. And then it got processed by reducer and got flushed to the disks once again. While the CPU needed for sorting is relatively small, especially if the sort key is a number and can be easily parsed from the input data.
Upvotes: 7