Reputation: 1507
I have a function that compares two files to see if they are the same. It reads the files byte by byte and checks to see they are the same.
The problem I'm having now is that for big files this function takes quite a long time.
What is the better, faster way to check if files are the same?
Upvotes: 4
Views: 5770
Reputation: 116
If you are not familiar with hashing search on google about "MD5" or "SHA" algorithms. Hashing is one of the efficient approaches to check consistence of files. Only you need is to find implementation of one of this algorithms and check them; for example:
if(md5(file1Path) == md5(file2Path))
cout<<"Files are equal"<<endl;
else
cout<<"Files are not equal"<<endl;
Upvotes: -2
Reputation: 104698
If you really want brute force comparison of two files, mmaping may help.
If you know the file structure of what you are reading, read unique sections which allow you to identify them quickly (e.g. a header and relevant chunks/sections). Of course, you will want to get its basic attributes before comparing.
Generate hashes (or something) if you do multiple comparisons.
Upvotes: 2
Reputation: 12317
Whilst there are a number of examples of cryptographic hash functions using SHA or MD5, for file comparisons its better to use a non-cryptographic hash as it will be faster:
https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions
The FNV hash is considered fast for your needs:
https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
Upvotes: 0
Reputation: 39807
When your files are not the same, are they likely to be of the same size? If not, you can determine the file sizes right away (fseek to the end, ftell to determine the position), and if they're different then you know they're not the same without comparing the data. If the size is the same, remember to fseek back to the beginning.
If you read your files into large buffers of memory and compare each buffer using memcmp() you will improve performance. You don't have to read the entire file at once, just set a large buffer size and read blocks of that size from each file, for each comparison iteration through your loop. The memcpy function will operate on 32 bit values, rather than 8 bit bytes.
Upvotes: 7
Reputation: 2026
Read the files in chunks of size X. With X up to 1-10-50 megabytes. Use memcmp()
on those chunks.
Upvotes: 0