myloginid
myloginid

Reputation: 1473

Compare HDFS Checksum to Local File System Checksum

I am trying to write a simple script to verify the HDFS and local filesystem checksums.

On HDFS i get -

[m@x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt  MD5-of-0MD5-of-512CRC32C        **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**

On the Local File System, I get -

[m@x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m@x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c**  file.txt

Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...

Is there a way to compare the 2 checksums using any algorithm?

thanks

Upvotes: 7

Views: 10768

Answers (5)

z11i
z11i

Reputation: 1296

Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.

This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:

hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp

Then, to get the checksum:

hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile

For the local copy:

crc32 myFile

If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:

hdfs dfs -put myFile /tmp

And compare the two files on HDFS with:

hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.


Reference:

Upvotes: 2

user9074332
user9074332

Reputation: 2636

Piping the results of a cat'd hdfs file to md5sum worked for me:

$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f  -

$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f  /path/to/localFilesystem/file.txt

This would not be recommended for massive files.

Upvotes: 1

r2d2
r2d2

Reputation: 21

I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)

see this

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg@mail.gmail.com%3E

and this

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%[email protected]%3E

Upvotes: 1

Hasan S Syed
Hasan S Syed

Reputation: 21

I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.

md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" [email protected] < /dev/null
fi

Upvotes: 0

jeetendra rawal
jeetendra rawal

Reputation: 19

This is not a solution but a workaround which can be used. Local File Checksum: cksum test.txt

HDFS Checksum : cksum hadoop fs -cat /user/test/test.txt > tmp.txt tmp.txt

You can compare them.

Hope it helps.

Upvotes: 1

Related Questions