Reputation: 41
I have an archive of about 100 million binary files. New files get added regularly. The file sizes range from about 0.1 MB to about 800 MB.
I can easily determine if files are probably completely identical by comparing their sizes and if the sizes match, by comparing the hashes of the files.
I want to find files that have partly similar content. With that I mean that I believe they have some parts that are identical and some parts that can be different.
What is the best, or any realistic way to find which files are similar to which other files, and if possible get some measure of how similar they are?
Edit: The files are mostly executables. They are similar if, say, somewhere between 10% and 100% of their contents are the same as the contents of another file. The lower limit could also be set to 50%. The exact lower limit is not important. I guess some form of hashing would be needed for this comparison to be doable over such an archive.
Upvotes: 4
Views: 199
Reputation: 810
Not an easy problem. The first step is to map each file into a set of hashes, i.e., integers. Ideally you want to do that by computing the hashes of a set of substrings in each file such that the substrings are uniformly distributed throughout the file but also the likelihood that a substring occurs in dissimilar files is rare. For example, if the files were English text you could choose to split the file into substrings at all the most common English words (the, to, be, of, and, ...). To do that with the executables I would first compute what the most common byte pairs or triples of all the files are and choose the top N to split the files that hopefully generate substrings that are "not too long." Just what "not to long" is with executables is something don't have a good idea of.
Once you hash those substrings you have the problem of finding similar sets, which is called the set similarity joins problem in computer science. See my post here for methods/code to solve that problem. Good luck!
Upvotes: 1
Reputation: 850
It depends on how you will be determining similarity, if for example you could determine similarity by comparing just the first 100 bytes of each file then I guess this would be achievable but to find a particular string comparison in 100 million files that can be 800MB large would be quite infeasible.
Upvotes: 1