Reputation: 4106
I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.
According to the following schema :
whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).
However, I don't understand how this is managed. Indeed, according to the same documentation :
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Is the schema presented above correct ? If so,
is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?
why does the client write to multiple nodes ?
If this schema is not correct, how is file creation working with HDFs ?
Upvotes: 5
Views: 864
Reputation: 3359
As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.
Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.
According to this article:
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, ... Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.
So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.
For your question:
why does the client write to multiple nodes ?
I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes
Upvotes: 2