merours
merours

Reputation: 4106

How is data written to HDFS?

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.

According to the following schema :

HDFS architecture

whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).

However, I don't understand how this is managed. Indeed, according to the same documentation :

The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Is the schema presented above correct ? If so,

Upvotes: 5

Views: 864

Answers (1)

Mouna
Mouna

Reputation: 3359

As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.

Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.

According to this article:

Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, ... Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.

So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.

For your question:

why does the client write to multiple nodes ?

I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes

Upvotes: 2

Related Questions