Suresh Babu D.V
Suresh Babu D.V

Reputation: 11

Output of reducer sent to HDFS where as map output is stored in data node local disk?

I am bit confused about HDFS storage and Data node storage. Below are my doubts.

  1. Map function output will be saved to data node local disk and reducer output will be sent to HDFS. As we all know that data blocks are stored in data nodes local disk is there any other disk space available for HDFS in data node??

  2. What is the physical storage location of reducer output file (part-nnnnn-r-00001) ? will it be stored in Name node hard disk?

So my assumption is data node is part is of HDFS i assume data node local disk is also part of HDFS.

Regards Suresh

Upvotes: 1

Views: 3262

Answers (2)

orNehPraka
orNehPraka

Reputation: 443

to answer your question,

  1. first of all we need to understand that mapping and reducing job performs at some data node choose by namenode. All the nodes are part of HDFS it self.

    So, when we say that "Map function output will be saved to data node local disk", that means that after performing mapping, that particular datanode keeps data at local disk, hidden from local file system say unix. It wait for reducer to read it and perform reducing phase. Mapper's datanode keep data save unto the job is completed.

    Now, reducer (some datanode choose by namenode), performs reducing phase.

  2. As per my understanding at time of writing map reduce job, we give output path. under that path it self part-nnnnn-r-00001..1000 and logs resides.

Upvotes: 0

Mouna
Mouna

Reputation: 3359

You must know the difference between virtual concept and the actual storage. HDFS (Hadoop Distributed File System) just specifies how data will be stored in datanodes. When you say store a file in HDFS it means that it will be virtually considered as an HDFS file but actually stored in the disk of a datanode.

Let's see in details how does it work:

  • HDFS as a block-structured file system: it will break individual files into blocks of a fixed size(by default 64 Mbytes). These blocks are stored across a cluster of machines composed of one namenode and several datanodes.

  • The nameNode handles the metadata structures (e.g., the names of files and directories) and regulates access to files it also executes operations like open/close/rename. To open a file, a client contacts the NameNode and retrieves a list of locations for the blocks that comprise the file. These locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers, possibly in parallel. The NameNode is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

  • DataNodes will bee responsible for serving read/write requests and block creation/deletion/replication. So every block in the HDFS system is actually stored in dataNode.

Upvotes: 4

Related Questions