Jens Schauder
Jens Schauder

Reputation: 81907

Fast Incremental Hash in Java

I'm looking for a hash function to hash Strings. For my purposes (identifying changed objects during an import) it should have the following properties:

  1. fast

  2. can be used incremental, i.e. I can use it like this:

    Hasher h = new Hasher();
    h.add("somestring");
    h.add("another part");
    h.add("eveno more");
    Long hash = h.create();
    

    without compromising the other properties or keeping the strings in memory during the complete process.

  3. Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.

It does not have to be secure against malicious attempts to create collisions.

What algorithm can I use? An algorithm with an existent free implementation in Java is preferred.

Clarification

  1. The hash doesn't have to be a long. A String for example would be just fine.

  2. The data to be hashed will come from a file or a db, with many 10MB or up to a few GB of data, that will get distributed into different Hashes. So keeping the complete Strings in memory is not really an option.

Upvotes: 2

Views: 2161

Answers (2)

Durandal
Durandal

Reputation: 20059

Not intended as an answer, just to demonstrate that hash collisions are much more likely than human intuition tends to assume.

The following tiny program generates 2^31 distinct strings and checks if any of their hashes collide. It does this by keeping a tracking bit per possible hash value (so you need >512MB heap to run it), to mark each hash value as "used" as they are encountered. It takes several minutes to complete.

public class TestStringHashCollisions {

    public static void main(String[] argv) {
        long collisions = 0;
        long testcount = 0;
        StringBuilder b = new StringBuilder(64);
        for (int i=0; i>=0; ++i) {
            // construct distinct string
            b.setLength(0);
            b.append("www.");
            b.append(Integer.toBinaryString(i));
            b.append(".com");

            // check for hash collision
            String s = b.toString();
            ++testcount;
            if (isColliding(s.hashCode()))
                ++collisions;

            // progress printing
            if ((i & 0xFFFFFF) == 0) {
                System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
            }
        }
        System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
        System.out.println("Collision ratio: " + (collisions / (double) testcount));
    }

    // storage for 2^32 bits in 2^27 ints
    static int[] bitSet = new int[1 << 27];

    // test if hash code has appeared before, mark hash as "used"
    static boolean isColliding(int hash) {
        int index = hash >>> 5;
        int bitMask = 1 << (hash & 31);
        if ((bitSet[index] & bitMask) != 0)
            return true;
        bitSet[index] |= bitMask;
        return false;
    }

}

You can adjust the string generation part easily to test different patterns.

Upvotes: 1

TheConstructor
TheConstructor

Reputation: 4465

Hashs are a sensible topic and it is hard to recommend any such hash based upon your question. You might want to ask this question on https://security.stackexchange.com/ to get expert opinions on the usability of hashs in certain usecases.

What I understood so far is that most hashs are implemented incrementally in the very core; the execution-timing on the other hand is not that easy to predict.

I present you two Hasher implementations which rely on "an existent free implementation in Java". Both implementations are constructed in a way that you can arbitrarily split your Strings before calling add() and get the same result as long as you do not change the order of the characters in them:

import java.math.BigInteger;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;

/**
 * Created for https://stackoverflow.com/q/26928529/1266906.
 */
public class Hashs {

    public static class JavaHasher {
        private int hashCode;

        public JavaHasher() {
            hashCode = 0;
        }

        public void add(String value) {
            hashCode = 31 * hashCode + value.hashCode();
        }

        public int create() {
            return hashCode;
        }
    }

    public static class ShaHasher {
        public static final Charset UTF_8 = Charset.forName("UTF-8");
        private final MessageDigest messageDigest;

        public ShaHasher() throws NoSuchAlgorithmException {
            messageDigest = MessageDigest.getInstance("SHA-256");
        }

        public void add(String value) {
            messageDigest.update(value.getBytes(UTF_8));
        }

        public byte[] create() {
            return messageDigest.digest();
        }
    }

    public static void main(String[] args) {
        javaHash();

        try {
            shaHash();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();  // TODO: implement catch
        }
    }

    private static void javaHash() {
        JavaHasher h = new JavaHasher();
        h.add("somestring");
        h.add("another part");
        h.add("eveno more");
        int hash = h.create();
        System.out.println(hash);
    }

    private static void shaHash() throws NoSuchAlgorithmException {
        ShaHasher h = new ShaHasher();
        h.add("somestring");
        h.add("another part");
        h.add("eveno more");
        byte[] hash = h.create();
        System.out.println(Arrays.toString(hash));
        System.out.println(new BigInteger(1, hash));
    }
}

Here obviously "SHA-256" could be replaced with other common hash-algorithms; Java ships quite a few of them.

Now you called out for a Long as return-value which would imply you are looking for a 64bit-Hash. If this really was on purpose have a look at the answers to What is a good 64bit hash function in Java for textual strings?. The accepted answer is a slight variant of the JavaHasher as String.hashCode() does basically the same calculation, but with lower overflow-boundary:

    public static class Java64Hasher {
        private long hashCode;

        public Java64Hasher() {
            hashCode = 1125899906842597L;
        }

        public void add(CharSequence value) {
            final int len = value.length();

            for(int i = 0; i < len; i++) {
                hashCode = 31*hashCode + value.charAt(i);
            }
        }

        public long create() {
            return hashCode;
        }
    }

Unto your points:

  1. fast

    With SHA-256 being slower than the other two I still would call all three presented approaches fast.

  2. can be used incremental without compromising the other properties or keeping the strings in memory during the complete process.

    I can not guarantee that property for the ShaHasher as I understand it is block-based and I lack the source code.Still I would suggest that at most one block, the hash and some internal states are kept. The other two obviously only store the partial hash between calls to add()

  3. Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.

    For every hash there are collisions. Given a good distribution the bit-size of the hash is the main factor on how often a collision happens. The JavaHasher is used in e.g. HashMaps and seems to be "collision-free" enough to distribute similar keys far apart each other. As for any deeper analysis: do your own tests or ask your local security engineer - sorry.

I hope this gives a good starting point, details are probably mainly opinion-based.

Upvotes: 3

Related Questions