Reputation: 451
The HDFS Client is outside the HDFS Cluster. When the HDFS Client write the file to hadoop the HDFS clients split the files into blocks and then it will write the block to datanode.
The question here is how the HDFS Client knows the Blocksize ? Block size is configured in the Name node and the HDFS Client has no idea about the block size then how it will split the file into blocks ?
Upvotes: 5
Views: 1678
Reputation: 186
HDFS is designed in a way where the block size for a particular file is part of the MetaData.
Let's just check what does this mean?
The client can tell the NameNode that it will put data to HDFS with a particular block size. The client has its own hdfs-site.xml that can contain this value, and can specify it on a per-request basis as well using the -Ddfs.blocksize parameter.
If the client configuration does not define this parameter, then it defaults to the org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_DEFAULT value which is 128MB.
NameNode can throw an error for the client if it specifies a blocksize that is smaller then dfs.namenode.fs-limits.min-block-size (1MB by default).
There is nothing magical in this, NameNode does know nothing about the data and let the client to decide the optimal splitting, as well as to define the replication factor for blocks of a file.
Upvotes: 3
Reputation: 534
Some more details below (from the Hadoop Definitive Guide 4th edition)
"The client creates the file by calling create() on DistributedFileSystem (step 1 in Figure 3-4). DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode. As the client writes data (step 3), the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue."
Adding more info in response to comment on this post:
Here is a sample client program to copy a file to HDFS (Source-Hadoop Definitive Guide)
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
If you look at create() method implementation in FileSystem class, it has getDefaultBlockSize() as one of its arguments, which inturn fetches the values from configuration which is turn is provided by the namenode. This is how client gets to know the block size configured on hadoop cluster.
Hope this helps
Upvotes: 0
Reputation: 3374
In simple words, When you do client URI deploy, it will place server URI into Client or you download and manually replace in client. So whenever client request for info, it will go to the NameNode and fetch the required info or place new info on DataNodes.
P.S: Client = EdgeNode
Upvotes: 0