How HDFS and MapReduce work with small files

Question

I have installed Hadoop on Windows platform with 2 other worker nodes(in total I have 3 nodes). For demo purpose, I am processing a single file of about 1 Megabyte.

How worker nodes would divide this file for processing. Would different worker nodes would process 341(1024/3)KB each. Or a single worker node would process the file.
And if I process 100 such files. Would worker nodes divide the number of files to be processed among them?
And if I process a single file of about 100MB.

SSaikia_JtheRocker · Accepted Answer

Probable answers,

How worker nodes would divide this file for processing. Would different worker nodes would process 341(1024/3)KB each. Or a single worker node would process the file?

Hadoop Distributed File System (HDFS) usually has larger block size compared to normal file system such as NTFS and FAT available in windows. Blocks in HDFS are scattered across multiple nodes(machines) with replication and if a rack topology script is provided, replication is handled better and the blocks are places more strategically to prevent data loss (e.g. if Hadoop unknowingly places a block with replication factor 2 in the same rack and the whole rack fails, oops!. A good strategy might be to store one block in one rack and other replicated block in a different rack). By Default size of one block is 64MB. So, a 1 MB file will probably reside inside a single block and of-course, will be replicated across different nodes. Usually a single Map works upon something called a split, which can be composed of one or more blocks. There can be may splits which different Maps can handle. TextInputFormat usually handles text files with endline as a delimiter and maps fired for each split, which is roughly the size of a single block. To ensure the endline boundary, the split size may be slightly greater or less than than that of a block size. Bottom line, your file of 1 MB which resides in a single block of 64MB under normal condition will be processed by a single map task.

And if I process 100 such files. Would worker nodes divide the number of files to be processed among them?

If 100 of separate such files are there, there is a probability that 100 map task will be invoked, unless you used something like the CombineInputFormat that can combine and process several blocks together as a split for a map.

Another option is combine those 100 files if possible into a single file and process.

And if I process a single file of about 100MB?

Again, assuming you block size to be 64MB, a 100MB file with TextInputFormat should roughly be processed by 2 map tasks. As I said with different InputFormat things could be processed in a different way!

Note(excerpt from here):

Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs.

How HDFS and MapReduce work with small files

Answers (1)

Related Questions