Reputation: 42246
In theory, hash functions produce a binary number having bounded (often fixed) length from binary data of arbitrary length.
In practice, specifications and implementations constrain the input length to be a multiple of 8. This is reasonable because, in practice, the binary data that clients hash is represented in chunks of bytes (not bits).
Suppose that we want to use hash algorithms in a context that cannot assume the input binary data is chunked into bytes.
For example, we want to compute a hash for a 13-bit data stream
0101111101111
.
Further, assume that the value intended by any binary fragment of a length indivisible by 8 is unambiguously distinct from any padded byte-divisible representation.
For example, our system requires that the 13-bit
0101111101111
means something fundamentally different from any 16-bit value (such as the left-padded0000101111101111
or right-padded0101111101111000
). Consequently, their hashes should be distinct.
What established hash algorithms are compatible with input binary data of arbitrary bit length?
What prior work and implementations exist that directly address this limitation of many common algorithms and libraries?
I recognize that the distinction between truncated arbitrary-length binary data and padded byte-chunks could be retrofitted into byte-divisible input by tagging the padded input data with a byte-divisible representation of its truncated length. If a less redundant approach is available, it would seem preferable.
Upvotes: 1
Views: 67
Reputation: 46389
Do as @MattTimmermans suggested. Add a 1
at the end, then pad to fill out the byte.
Then if you feel that this is wasteful, you can simply truncate the last few bits off of the hash value that you get. You'll increase the odds of random collisions slightly, but it will still be a usable hash.
Specifying the actual computation in terms of bits doesn't make much sense. CPUs have highly optimized operations corresponding to common data types of specific lengths. A good hashing algorithm will take advantage of this. Trying to roll one with bit specific optimizations is a lot of extra work that won't take advantage of the hardcoded optimizations that the chip already has.
Upvotes: -1