Reputation: 317
Why can't HDFS client send directly to DataNode?
What's the advantage of HDFS client cache?
Upvotes: 2
Views: 969
Reputation: 9844
It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.
Instead, the client immediately issues a create
RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock
RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.
Note however that alternative file systems, such as S3AFileSystem
which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)
I have filed Apache JIRA HDFS-11995 to track correcting the documentation.
Upvotes: 4