Reputation: 1473
I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m@x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m@x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m@x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks
Upvotes: 7
Views: 10768
Reputation: 1296
Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put
the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.
This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:
hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp
Then, to get the checksum:
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile
For the local copy:
crc32 myFile
If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:
hdfs dfs -put myFile /tmp
And compare the two files on HDFS with:
hdfs dfs -checksum /tmp/myFile
and hdfs dfs -checksum /tmp/myOtherFile
.
Reference:
Upvotes: 2
Reputation: 2636
Piping the results of a cat'd hdfs file to md5sum worked for me:
$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f -
$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f /path/to/localFilesystem/file.txt
This would not be recommended for massive files.
Upvotes: 1
Reputation: 21
I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%[email protected]%3E
Upvotes: 1
Reputation: 21
I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.
md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" [email protected] < /dev/null
fi
Upvotes: 0
Reputation: 19
This is not a solution but a workaround which can be used. Local File Checksum: cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt
tmp.txt
You can compare them.
Hope it helps.
Upvotes: 1