Reputation: 143
According to Hadoop: The Definitive Guide, Second Edition
A. Datanodes are responsible for verifying the data they receive before storing the data and its checksum.
Do they verify the data by verifying the checksum?
B. A client writing data sends it to a pipeline of datanodes (as explained in Chapter 3), and the last datanode in the pipeline verifies the checksum.
So, does it mean that each and every datanode verifies the checksum (as mentioned in A) or only the last datanode in the pipeline verifies the checksum (as mentioned in B).
Upvotes: 1
Views: 2201
Reputation: 7794
It depends on what version of Hadoop you are running. The latest version only does checksum checking on the last data node as there was no real reason to do it on each node as explained in JIRA: https://issues.apache.org/jira/browse/HADOOP-3328
Its also worth noting that a client when reading the blocks back will also check the checksum for each block read. If the blocks do not match with their corresponding check sum then the client will request the same block from another datanode which has a replica of that block.
Upvotes: 2