Nitish Upreti
Nitish Upreti

Reputation: 6550

Why is Hadoop considered to be I/O intensive?

I have been reading some literature on Hadoop Map/Reduce and a general theme seems to be is : Hadoop Jobs are I/O intensive (Example : Sorting with Map/Reduce).

What makes these jobs I/O intensive (Given the fact Hadoop pushes computation to data)? Example : Why is Sorting in Hadoop I/O intensive?

My intuition : It seems that after the map phase, the intermediate pairs are sent to reducers. Is this causing the huge I/O?

Upvotes: 3

Views: 3951

Answers (3)

Performance of Hadoop and map reduce is limited. The reason for limited performance of both is file system that generates a lot of input and output files. It reduces the computational speed of both.The results of map reduce can be stored in memory so that it will speed up the computational speed.

Upvotes: 0

qwr
qwr

Reputation: 10919

In Hadoop's MapReduce framework, after each MapReduce operation, the results are written back to disk, which can be highly I/O intensive. Apache Spark improves on this by being able to store or cache small enough intermediate results in memory, as well as optimizing transformations as a DAG performed on RDDs. See How is Apache Spark different from the Hadoop approach?

Upvotes: 1

0x0FFF
0x0FFF

Reputation: 5018

Hadoop is used for performing computations over the large amounts of data. Your jobs might be bounded by the resources of IO (I/O intensive as you call it), CPU and Network. In the classic case of Hadoop usage you are performing local computations over huge amounts of input data while returning relatively small result set, which makes your task be more IO intensive than CPU and Network intensive, but it hugely depends on the job itself. Here are some examples:

  1. IO Intensive job. You read much data on the map side, but the result of your map task is not that big. An example is calculating amount of rows in the input text, calculating the sum over some column from RCfile, getting the result of the Hive query over a single table with group by a column with relatively small cardinality. This would mean that the thing your job is doing is mostly reading data and make some simple processing over it.
  2. CPU Intensive job. When you need to perform some complex computations on the map or reduce side. For instance, you are doing some kind of the NLP (natural language processing) like tokenization, part of speach tagging, stemming and so on. Also if you store the data in a formats with high compression rates data decompression might become the bottleneck of the process (here's an example from Facebook where they were looking for a balance between CPU and IO)
  3. Network Intensive. Usually if you see high network utilization on the cluster it means that someone has missed the point and implemented the job that transfers much data over the network. In the example with wordcount, imagine the processing of 1PB of the input data within this job with only mapper and reducer, without combiner. This way the amount of data moved between map and reduce tasks would be even bigger than input data set, and all of this would be sent over the network. Also this might mean that you don't use intermediate data compression (mapred.compress.map.output and mapred.map.output.compression.codec) and the raw map output is sent over the network.

You can refer to this guide for the initial tuning of the cluster So why sorting is IO intensive? First, you read the data from the disks. Next, in sorting the amount of data produced by mappers is the same amount that was read, means that most likely it wont fit in memory and should be spilled to the disks. Then it got transferred to reducers and spilled to the disks again. And then it got processed by reducer and got flushed to the disks once again. While the CPU needed for sorting is relatively small, especially if the sort key is a number and can be easily parsed from the input data.

Upvotes: 7

Related Questions