user130076
user130076

Reputation:

Is there a hash function for binary data which produces closer hashes when the data is more similar?

I'm looking for something like a hash function but for which it's output is closer the closer two different inputs are?

Something like:

f(1010101) = 0 #original hash

f(1010111) = 1 #very close to the original hash as they differ by one bit
f(0101010) = 9999 #not very close to the original hash they all bits are different

(example outputs for demonstration purposes only)

All of the input data will be of the same length.

I want to make comparisons between a file a lots of other files and be able to determine which other file has the fewest differences from it.

Upvotes: 3

Views: 2589

Answers (7)

jwilkins
jwilkins

Reputation: 401

What you're looking for is a file fingerprint of sorts. For plain text, something like Nilsimsa (http://ixazon.dynip.com/~cmeclax/nilsimsa.html) works reasonably well.

There are a variety of different names for this type of technique. Fuzzy Hashing/Locality Sensitive Hashing/Distance Based Hashing/Dimensional reduction and a few others. Tools can generate a fixed length output or variable length output, but the outputs are generally comparable (eg by levenshtein distance) and similar inputs yield similar outputs.

The link above for nilsimsa gives two similar spam messages and here are the example outputs:

773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025 spam1
47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645 spam2
 *  * ** *** * ** ** ** **     *  *******  **** **     *    *  *

Spamsum and sdhash are more useful for arbitrary binary data. There are also algorithms specifically for images that will work regardless of whether it's a jpg or a png. Identical images in different formats wouldn't be noticed by eg spamsum.

Upvotes: 0

ffriend
ffriend

Reputation: 28492

You can represent your data as a binary vector of features and then use dimensionality reduction either with SVD or with random indexing.

Upvotes: 0

user555045
user555045

Reputation: 64904

You could calculate the population count of the XOR of the two files, which is exactly the number of bits that are not the same between the two files. So it just does precisely what you asked for, no approximations.

Upvotes: 0

phs
phs

Reputation: 11051

You might be interested in either simhashing or shingling.

If you are only trying to detect similarity between documents, there are other techniques that may suit you better (like TF-IDF.) The second link is part of a good book whose other chapters delve into general information retrieval topics, including these other techniques.

Upvotes: 1

Trott
Trott

Reputation: 70065

You might want to look at the source code to unix utilities like cmp or the FileCmp stuff in Python and use that to try to determine a reasonable algorithm.

In my uninformed opinion, calculating a hash is not likely to work well. First, it can be expensive to calculate a hash. Second, what you're trying to do sounds more like a job for encoding than a hash; once you start thinking of it that way, it's not clear that it's even worth transforming the file that way.

If you have some constraints, specifying them might be useful. For example, if all the files are the exact same length, that may simplify things. Or if you are only interested in differences between bits in the same position and not interested in things that are similar only if you compare bits in different positions (e.g., two files are identical, except that one has everything shifted three bits--should those be considered similar or not similar?).

Upvotes: 0

Michael
Michael

Reputation: 53

You may try this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance

Since this is string only. You may convert all your binary to string for example: 0 -> "00000000" 1 -> "00000001"

Upvotes: 1

bokan
bokan

Reputation: 3692

You should not use a hash for this.

You must compute signatures containing several characteristic values like :

  • file name
  • file size
  • Is binary / Is ascii only
  • date (if needed)

some other more complex like :

  • variance of the values of bytes
  • average value of bytes
  • average length of same value bits sequence (in compressed files there are no long identical bit sequences)
  • ...

Then you can compare signatures.

But the most important is to know what kind of data is in these files. If it is images, the size and main color are more important. If it is sound, you could analyse only some frequencies...

Upvotes: 0

Related Questions