Hadoop MapReduce WordCount example flaw?

Question

With reference to the basic WordCount example: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html I know that HDFS divide files in blocks, and maps tasks works on a single block. So there is no guarantee the block analyzed by a map task would not contain a word continuing in the next block, causing a mistake ( one word counted twice ). I know this is an example, and is always shown with small file, but wouldn't be a problem in real world scenarios?

Stefan Papp · Accepted Answer

In Hadoop you work on input splits and not on blocks. An input split is a complete data set. You want to avoid the case wherein one mapper goes over two splits as this costs performance as well as you create traffic.

In a text world, lets say you are in block1 and you have a sentence such as "I am a Ha" and block2 continues with "doop developer", then this creates network traffic as we always have to work on a node with a full input split and some data has to be transferred to the other node.

Hadoop MapReduce WordCount example flaw?

Answers (1)

Related Questions