Reputation: 51

Hadoop block checksum: stored in namenode too?

The checksum of a HDFS block is stored in a local file, along with the raw content of the block, both on each of the dedicated datanodes (replica).

I am wondering: is the checksum of a block stored also within the namenode, as part of the metadata of a file?

Upvotes: 1

Answers (2)

Abdelrahman Maharek

Reputation: 872

The Short Answer: Checksums are stored on datanodes

Explanation:

HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes, and because a CRC-32C checksum is 4 bytes long, the storage overhead is less than 1%.
Datanodes are responsible for verifying the data they receive before storing the data and its checksum. This applies to data that they receive from clients and from other datanodes during replication.
A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum.
- If the datanode detects an error, the client receives a subclass of IOException, which it should handle in an application-specific manner (for example, by retrying the operation).
When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
When a client successfully verifies a block, it tells the datanode, which updates its log. Keeping statistics such as these is valuable in detecting bad disks.
In addition to block verification on client reads, each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media.

see "hadoop the definitive guide 4th edition page 98"

Upvotes: 0

Rahul

Reputation: 2384

No. The checksum is stored only along with the blocks on the slave nodes[sometimes also called as Data Nodes].

From the Apache Documentation for HDFS

Data Integrity

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software.

It works in the following manner.

The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If the checksum of another Data node block matches with the checksum of the hidden file, the system will serve these data blocks.

Upvotes: 2

Hadoop block checksum: stored in namenode too?

Answers (2)

Related Questions