Reputation: 11
I am bit confused about HDFS storage and Data node storage. Below are my doubts.
Map function output will be saved to data node local disk and reducer output will be sent to HDFS. As we all know that data blocks are stored in data nodes local disk is there any other disk space available for HDFS in data node??
What is the physical storage location of reducer output file (part-nnnnn-r-00001) ? will it be stored in Name node hard disk?
So my assumption is data node is part is of HDFS i assume data node local disk is also part of HDFS.
Regards Suresh
Upvotes: 1
Views: 3262
Reputation: 443
to answer your question,
first of all we need to understand that mapping and reducing job performs at some data node choose by namenode. All the nodes are part of HDFS it self.
So, when we say that "Map function output will be saved to data node local disk", that means that after performing mapping, that particular datanode keeps data at local disk, hidden from local file system say unix. It wait for reducer to read it and perform reducing phase. Mapper's datanode keep data save unto the job is completed.
Now, reducer (some datanode choose by namenode), performs reducing phase.
As per my understanding at time of writing map reduce job, we give output path. under that path it self part-nnnnn-r-00001..1000 and logs resides.
Upvotes: 0
Reputation: 3359
You must know the difference between virtual concept and the actual storage. HDFS (Hadoop Distributed File System) just specifies how data will be stored in datanodes. When you say store a file in HDFS it means that it will be virtually considered as an HDFS file but actually stored in the disk of a datanode.
Let's see in details how does it work:
HDFS as a block-structured file system: it will break individual files into blocks of a fixed size(by default 64 Mbytes). These blocks are stored across a cluster of machines composed of one namenode and several datanodes.
The nameNode handles the metadata structures (e.g., the names of files and directories) and regulates access to files it also executes operations like open/close/rename. To open a file, a client contacts the NameNode and retrieves a list of locations for the blocks that comprise the file. These locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers, possibly in parallel. The NameNode is not directly involved in this bulk data transfer, keeping its overhead to a minimum.
Upvotes: 4