Reputation: 721
I want to load the MD5 of may different files. I am following this answer to do that but the main problem is that the time taken to load the MD5 of the files ( May be in hundreds) is a lot.
Is there any way which can be used to find the MD5 of an file without consuming much time.
Note- The size of the file may be large ( May go up to 300MB).
This is the code which I am using -
import java.io.*;
import java.security.MessageDigest;
public class MD5Checksum {
public static byte[] createChecksum(String filename) throws Exception {
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
return complete.digest();
}
// see this How-to for a faster way to convert
// a byte array to a HEX string
public static String getMD5Checksum(String filename) throws Exception {
byte[] b = createChecksum(filename);
String result = "";
for (int i=0; i < b.length; i++) {
result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
}
return result;
}
public static void main(String args[]) {
try {
System.out.println(getMD5Checksum("apache-tomcat-5.5.17.exe"));
// output :
// 0bb2827c5eacf570b6064e24e0e6653b
// ref :
// http://www.apache.org/dist/
// tomcat/tomcat-5/v5.5.17/bin
// /apache-tomcat-5.5.17.exe.MD5
// 0bb2827c5eacf570b6064e24e0e6653b *apache-tomcat-5.5.17.exe
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Upvotes: 2
Views: 2144
Reputation: 1407
You cannot use hashes to determine any similarity of content.
For instance, generating the MD5 of hellostackoverflow1 and hellostackoverflow2 calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes.
But as already noted, some speed might be achieved by using native code, thus the NDK. Additionally, if you still want to compare files for exact matches, first compare the size in bytes, after that use a hashing algorithm with enough speed and a low risk of collisions. As stated, CRC32 is fine.
Upvotes: 2
Reputation: 42585
Hash/CRC calculation takes some time as the file has to be read completely.
The code of createChecksum
you presented is nearly optimal. The only parts that can be tweaked is the read buffer size (I would use a buffer size 2048 bytes or larger). However this may get you a maximum of 1-2% speed improvement.
If this is still too slow the only option left is to implement the hashing in C/C++ and use it as native method. Besides that there is nothing you can do.
Upvotes: 0