Why HDFS client caches the file data into a temporary local file?

Question

Why can't HDFS client send directly to DataNode?

What's the advantage of HDFS client cache?

An application request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file.
When the local file accumulates data worth at least one HDFS block size, the client contacts the NameNode to create a file.
The NameNode then proceeds as described in the section on Create. The client flushes the block of data from the local temporary file to the specified DataNodes.
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
The client then tells the NameNode that the file is closed.
At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.

Chris Nauroth · Accepted Answer

It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.

Instead, the client immediately issues a create RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.

Note however that alternative file systems, such as S3AFileSystem which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)

I have filed Apache JIRA HDFS-11995 to track correcting the documentation.

Why HDFS client caches the file data into a temporary local file?

Answers (1)

Related Questions