Reputation: 1
I am writing a program that compare a lot of files.
I first group files by filesize. Then I check them byte by byte between grouped files. What params or propeties can I check before byte by byte comparsion to minimize using it?
Upd: To get check sum i need to read entire file. I seek some property that can filter unequal files. I forgot to say that i need 100% equal of files. Hash functions have collision.
Upvotes: 0
Views: 277
Reputation: 11562
If the files are recorded as being the same size by the operating system then there is no way to know if they are different other than checking bytes.
For a group of files, once two files are known to be the same, then the comparison only needs to be done for one of the two. It would be wise to sort the files in a group by date for this reason, on the theory that files with similar dates are more likely to be identical. Thus, you should maintain lists of identical files. When a new comparison is done it need only be compared to the head of the list.
You should allocate as much memory as possible up front and keep the list heads in memory.
When the comparison is being done you should not actually compare bytes, but words. For example, on a 32-bit machine you would read data in 512-byte blocks from the hard drive and then each block would be compared 4-bytes at a time. Newer x86 processors have vectorized op instructions called MMX. You want to be sure you are using those.
If you are writing in C for an Intel box, use Intel's compiler, not Microsoft's. Double check the assembly to make sure the compiler is not doing something stupid.
You can also increase the speed of the work by parallelizing it. This is done by creating threads. For example, if the code is running on a quad core machine you create 4 threads and divide the work among the 4 threads.
Upvotes: 2
Reputation: 33867
Check file's checksum. It was mend for this task
For Python you can use hashlib. For C you can use, for example, md5 from openssl. There are similar functions for php, MySQL, and probably for every other programming language
Eventually you can use linux built-in md5sum
Upvotes: 0