Reputation: 4741
I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().
I am getting the following output for the above code:
Null
HDFS : null
Local : null
Can anyone point out the error or mistake? Here is the Code :
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
// Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
// Path localPath = new Path("file:///home/ubuntu/derby.log");
// System.out.println("HDFS PATH : "+hdfsPath.getName());
// System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
Upvotes: 8
Views: 13633
Reputation: 76
Try this. In this i have calculated the MD5 of both local and HDFS file and then compared the same for both files equality. Hope this helps.
public static void compareChecksumForLocalAndHdfsFile(String sourceHdfsFilePath, String sourceLocalFilepath, Map<String, String> hdfsConfigMap)
throws Exception {
System.setProperty("HADOOP_USER_NAME", hdfsConfigMap.get(Constants.USERNAME));
System.setProperty("hadoop.home.dir", "/tmp");
Configuration hdfsConfig = new Configuration();
hdfsConfig.set(Constants.USERNAME, hdfsConfigMap.get(Constants.USERNAME));
hdfsConfig.set("fsURI", hdfsConfigMap.get("fsURI"));
FileSystem hdfs = FileSystem.get(new URI(hdfsConfigMap.get("fsURI")), hdfsConfig);
Path inputPath = new Path(hdfsConfigMap.get("fsURI") + "/" + sourceHdfsFilePath);
InputStream is = hdfs.open(inputPath);
String localChecksum = getMD5Checksum(new FileInputStream(sourceLocalFilepath));
String hdfsChecksum = getMD5Checksum(is);
if (null != hdfsChecksum || null != localChecksum) {
System.out.println("HDFS Checksum : " + hdfsChecksum.toString() + "\t" + hdfsChecksum.length());
System.out.println("Local Checksum : " + localChecksum.toString() + "\t" + localChecksum.length());
if (hdfsChecksum.toString().equals(localChecksum.toString())) {
System.out.println("Equal");
} else {
System.out.println("UnEqual");
}
} else {
System.out.println("Null");
System.out.println("HDFS : " + hdfsChecksum);
System.out.println("Local : " + localChecksum);
}
}
public static byte[] createChecksum(String filename) throws Exception {
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
return complete.digest();
}
// see this How-to for a faster way to convert
// a byte array to a HEX string
public static String getMD5Checksum(String filename) throws Exception {
byte[] b = createChecksum(filename);
String result = "";
for (int i = 0; i < b.length; i++) {
result += Integer.toString((b[i] & 0xff) + 0x100, 16).substring(1);
}
return result;
}
OutPut:
HDFS Checksum : d99513cc4f1d9c51679a125702bd27b0 32
Local Checksum : d99513cc4f1d9c51679a125702bd27b0 32
Equal
Upvotes: 0
Reputation: 370
Since you aren't setting a remote address on the conf
and essentially using the same configuration, both hadoopFS
and localFS
are pointing to an instance of LocalFileSystem
.
getFileChecksum
isn't implemented for LocalFileSystem
and returns null. It should be working for DistributedFileSystem
though, which if your conf
is pointing to a distributed cluster, FileSystem.get(conf)
should return an instance of DistributedFileSystem
that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum
. This value depends on the block size and the cluster-wide config, bytes.per.checksum
. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum
parameter.
The getFileChecksum
isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop
Upvotes: 11