SnakeDoc
SnakeDoc

Reputation: 14361

Get File Hash Performance/Optimization

I'm trying to get the hash of a file as fast as possible. I have a program that hashes large sets of data (100GB+) consisting of random file sizes (anywhere from a few KB up to 5GB+ per file) across anywhere between a handful of files up to several hundred thousand files.

The program must support all Java supported algorithms (MD2, MD5, SHA-1, SHA-256, SHA-384, SHA-512).

Currently I use:

/**
 * Gets Hash of file.
 * 
 * @param file String path + filename of file to get hash.
 * @param hashAlgo Hash algorithm to use. <br/>
 *     Supported algorithms are: <br/>
 *     MD2, MD5 <br/>
 *     SHA-1 <br/>
 *     SHA-256, SHA-384, SHA-512
 * @return String value of hash. (Variable length dependent on hash algorithm used)
 * @throws IOException If file is invalid.
 * @throws HashTypeException If no supported or valid hash algorithm was found.
 */
public String getHash(String file, String hashAlgo) throws IOException, HashTypeException {
    StringBuffer hexString = null;
    try {
        MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
        FileInputStream fis = new FileInputStream(file);

        byte[] dataBytes = new byte[1024];

        int nread = 0;
        while ((nread = fis.read(dataBytes)) != -1) {
            md.update(dataBytes, 0, nread);
        }
        fis.close();
        byte[] mdbytes = md.digest();

        hexString = new StringBuffer();
        for (int i = 0; i < mdbytes.length; i++) {
            hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
        }

        return hexString.toString();

    } catch (NoSuchAlgorithmException | HashTypeException e) {
        throw new HashTypeException("Unsuppored Hash Algorithm.", e);
    }
}

Is there a more optimized way to go about getting a files hash? I'm looking for extreme performance and am not sure if I have gone about this the best way.

Upvotes: 4

Views: 1091

Answers (2)

tgkprog
tgkprog

Reputation: 4598

In addition to Ernest's answer :- MessageDigest.getInstance(validateHashType(hashAlgo)) I think this can be cached in a thread local hashmap with validateHashType(hashAlgo) as the key. Making MessageDigest takes time but you can reuse them : by calling the reset() method at the start after getting instance from Map.

See the javadoc of java.lang.ThreadLocal

Upvotes: 1

Ernest Friedman-Hill
Ernest Friedman-Hill

Reputation: 81684

I see a number of potential performance improvements. One is to use StringBuilder instead of StringBuffer; it's source-compatible but more performant because it's unsynchronized. A second (much more important) would be to use FileChannel and the java.nio API instead of FileInputStream -- or at least, wrap the FileInputStream in a BufferedInputStream to optimize the I/O.

Upvotes: 5

Related Questions