smartcaveman
smartcaveman

Reputation: 42246

Which hash algorithms support binary input of arbitrary bit length?

Background

In theory, hash functions produce a binary number having bounded (often fixed) length from binary data of arbitrary length.

In practice, specifications and implementations constrain the input length to be a multiple of 8. This is reasonable because, in practice, the binary data that clients hash is represented in chunks of bytes (not bits).

Suppose that we want to use hash algorithms in a context that cannot assume the input binary data is chunked into bytes.

For example, we want to compute a hash for a 13-bit data stream 0101111101111.

Further, assume that the value intended by any binary fragment of a length indivisible by 8 is unambiguously distinct from any padded byte-divisible representation.

For example, our system requires that the 13-bit 0101111101111 means something fundamentally different from any 16-bit value (such as the left-padded 0000101111101111 or right-padded 0101111101111000). Consequently, their hashes should be distinct.

Question

Note

I recognize that the distinction between truncated arbitrary-length binary data and padded byte-chunks could be retrofitted into byte-divisible input by tagging the padded input data with a byte-divisible representation of its truncated length. If a less redundant approach is available, it would seem preferable.

Upvotes: 1

Views: 67

Answers (1)

btilly
btilly

Reputation: 46389

Do as @MattTimmermans suggested. Add a 1 at the end, then pad to fill out the byte.

Then if you feel that this is wasteful, you can simply truncate the last few bits off of the hash value that you get. You'll increase the odds of random collisions slightly, but it will still be a usable hash.

Specifying the actual computation in terms of bits doesn't make much sense. CPUs have highly optimized operations corresponding to common data types of specific lengths. A good hashing algorithm will take advantage of this. Trying to roll one with bit specific optimizations is a lot of extra work that won't take advantage of the hardcoded optimizations that the chip already has.

Upvotes: -1

Related Questions