Litovchenko Ivan
Litovchenko Ivan

Reputation: 1

How to first check files on equality before doing a byte by byte comparison?

I am writing a program that compare a lot of files.

I first group files by filesize. Then I check them byte by byte between grouped files. What params or propeties can I check before byte by byte comparsion to minimize using it?

Upd: To get check sum i need to read entire file. I seek some property that can filter unequal files. I forgot to say that i need 100% equal of files. Hash functions have collision.

Upvotes: 0

Views: 277

Answers (2)

Tyler Durden
Tyler Durden

Reputation: 11562

If the files are recorded as being the same size by the operating system then there is no way to know if they are different other than checking bytes.

For a group of files, once two files are known to be the same, then the comparison only needs to be done for one of the two. It would be wise to sort the files in a group by date for this reason, on the theory that files with similar dates are more likely to be identical. Thus, you should maintain lists of identical files. When a new comparison is done it need only be compared to the head of the list.

You should allocate as much memory as possible up front and keep the list heads in memory.

When the comparison is being done you should not actually compare bytes, but words. For example, on a 32-bit machine you would read data in 512-byte blocks from the hard drive and then each block would be compared 4-bytes at a time. Newer x86 processors have vectorized op instructions called MMX. You want to be sure you are using those.

If you are writing in C for an Intel box, use Intel's compiler, not Microsoft's. Double check the assembly to make sure the compiler is not doing something stupid.

You can also increase the speed of the work by parallelizing it. This is done by creating threads. For example, if the code is running on a quad core machine you create 4 threads and divide the work among the 4 threads.

Upvotes: 2

Jakub M.
Jakub M.

Reputation: 33867

Check file's checksum. It was mend for this task

For Python you can use hashlib. For C you can use, for example, md5 from openssl. There are similar functions for php, MySQL, and probably for every other programming language

Eventually you can use linux built-in md5sum

Upvotes: 0

Related Questions