030
030

Reputation: 11679

Write Path HDFS

Introduction

Follow-up question to this question.

A File has been provided to HDFS and has been subsequently replicated to three DataNodes. If the same file is going to be provided again, HDFS indicates that the file already exists.

Based on this answer a file will be split into blocks of 64MB (depending on the configuration settings). A mapping of the filename and the blocks will be created in the NameNode. The NameNode knows in which DataNodes the blocks of a certain file reside. If the same file is provided again the NameNode knows that blocks of this file exists on HDFS and will indicate that the file already exits.

If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename? Which process is responsible for this?

Which process is responsible for splitting a file into blocks?

Example Write path:

According to this documentation the Write Path of HBase is as follows:

HBase Write Path

Possible Write Path HDFS:

  1. file provided to HDFS e.g. hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso /
  2. FileName checked in FSImage whether it already exists. If this is the case the message file already exists is displayed
  3. file split into blocks of 64MB (depending on configuration setting). Question: Name of the process which is responsible for block splitting?
  4. blocks replicated on DataNodes (replication factor can be configured)
  5. Mapping of FileName to blocks (MetaData) stored in EditLog located in NameNode

Question

How does the HDFS' Write Path look like?

Upvotes: 2

Views: 1472

Answers (1)

cabad
cabad

Reputation: 4575

If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename?

No, it does not update the file. The name node only checks if the path (file name) already exists.

How does the HDFS' Write Path look like?

This is explained in detail in this paper: "The Hadoop Distributed File System" by Shvachko et al. In particular, read Section 2.C (and check Figure 1):

"When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1."

NOTE: A book chapter based on this paper is available online too. And a direct link to the corresponding figure (Fig. 1 on the paper and 8.1 on the book) is here.

Upvotes: 2

Related Questions