neoaggelos
neoaggelos

Reputation: 640

Fast way to check if a large set of file pairs are identical

I'm building a small sync utility in C++, for personal use mostly.

Imagine we have two directories 'A' and 'B' that will be synced. At some point, the new files from A will have to be copied to B. The logic I used so far was:

browse directory 'A'
for each 'A/afile'
    copy A/afile to B/afile
endfor
for each 'A/adirectory'
    recurse into 'A/adirectory'
endfor

This worked well, until I noticed that with the method above, ALL of the files from A are copied EVERY TIME to B. So, I would only like to perform the copy operation if A/afile and B/afile are different.

So, my question is, how can I compare them in a fast and cross-platform (hopefully) manner? Will something like computing MD5 checksums for each file be fast?

The point is, since the file comparison will be probably done for a large number of file pairs, I want something that is both reliable and fast. And by fast I mean that the 'heavy and time-consuming' task should be the actual copy operation and not the file checking.

PS. I also tried finding 'tricks', like comparing file size and modification times, with no success.


EDIT

After taking into consideration the answers below, what I will finally go with to check if the two files are the same is:

if optimize_speed then
      if A/afile is newer then no (cause A/afile is the 'source' file)
      if B/afile is newer then compare byte-to-byte and decide 
else
      compare byte-to-byte and decide
end

Upvotes: 2

Views: 1726

Answers (4)

user1812457
user1812457

Reputation:

Since there is a bit of misinformation in every of the answers so far, here is my take on it:

  1. If you really want to know that two files are identical, you have to compare them byte by byte. The chance that a checksum is the same but the files are different is very, very, very small, with a good checksum. However, computing a checksum is almost certainly slower than comparing file contents directly when both files are local. (The reason why rsync does not compare file contents is that it is meant for syncing remote files.)

  2. If there is very little probability that you touch a file or otherwise change its time stamp without changing its contents, then go ahead and only compare time stamps. In some rare cases you will copy a file that has not changed, but you won't have to compare unnecessarily file contents.

  3. Comparing sizes is not a good idea, esp. if you change some of the contents of the file.

  4. Yes, comparing file contents means reading both files. There are ways to make this more efficient, but still, it will take time roughly linear to the smaller file size. If you really want to do it, consider using an existing command line tool like cmp.

Here is one way to invoke cmp:

cmp --silent file1 file2

This will tell you if two files are identical in contents (exit status 0) or different (exit status 1) or if something is amiss, like one of the two files does not exist (exit status 2). A bash script that takes two arguments and copies the first to the second if they are different:

if cmp --silent "$1" "$2"
then
    :
else
    cp "$1" "$2"
fi

Take home message: figure out what your use case is before implementing a solution.

Upvotes: 1

rbaleksandar
rbaleksandar

Reputation: 9701

Comparing timestamps is only part of the solution but is far from the actually complete one. I strongly disagree with @KemyLand.

Imagine the following situation:

You have two files F1 and F2. Both have the same content. Let's say modification timestamp is 14:00:00. Then you decide to change F1 and save it. Modification timestamp changes for F1 to 14:02:00. However before the next synch event you reverse the change in F1 (for example you have deleted a string in there but decided to go back and add the string again). Modification timestamp changes again for F1 to 14:06:00. However here we have a problem - even though the modification timestamp of F1 is 14:06:00 and of F2 - 14:00:00 - the content itself is the same!

Timestamp difference can however be a trigger to initiate a more in-depth check of the files that are to be synched. The content of the files however has to be checked and hashing is the best way to do that.

Calculating hashes for a large set of pairs of files can be a pain. However what you can do is scale your tool and try to optimize the usage of resources you have at your disposal. For example if you have 2 cores - use boths. If you have 8 cores - use all 8 and so on.

If you use timestamps as your only check you might end up needlessly copying files. Computing a hash for a pair of files is much better then blindly copying heaps of files just because a timestamp has changed but their contents have otherwise remained unchanged.

Use a combination of triggering events for more in-depth checks (timestamps changes, file size changes etc.) but always do hashing (or at least comparing using something like cmp)to ensure you don't fall into one of the many scenarios (the one I mentioned above is just the tip of the iceberg) where you waste read and write access to your data. Calculating hashes for a large set of files is much more efficient compared to copying the data. Doesn't matter if you have a HDD or SSD as storage. The more files you have to move around the worse it gets when it comes to reads and writes. Size of each file also matters.

Upvotes: 0

3442
3442

Reputation: 8576

Given any pair of synchronizable files A and B, the synchronization is required as long as the modification timestamps of both files are not equal.

The problem is ahem... timestamps are not part of the C++ standard... So, you'll either need to use something like Boost/Qt for cross-platform purposes.

The other way is, of course, ignore portability and take the solution for POSIX (p.d: remember to check for return values!):

#include <sys/types.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <unistd.h>
#include <utime.h>

struct stat statOfA;
struct stat statOfB;
stat(pathOfA, &statOfA);
stat(pathOfB, &statOfB);

if(statOfA.st_mtime > statOfB.st_mtime) {
    // Sync! Then...
    struct timeval now;
    gettimeofday(&now, NULL);    // nullptr is prefered in C++11...

    struct timeval copys[] = { now, now };
    utimes(pathOfA, copys);
    utimes(pathOfB, copys);
}

Edit: You may see GetSystemTime(), SystemTimeToFileTime(), GetFileTime(), and SetFileTime() if you need to use the Windows API.

Upvotes: 2

jbm
jbm

Reputation: 3163

It's going to be a tradeoff between speed and reliability. You want to try the fastet method first, then go to something more precise. Here is the algorithm followed by fdupes:

  • compare file sizes
  • => if different, then action (in your case, copy)
  • compare MD5 signatures
  • => if different, copy
  • compare byte-by-byte
  • => if different, copy
  • else do nothing

Preparing this answer, I just learned that fdupes now add an intermediate step with partial MD5:

http://en.wikipedia.org/wiki/Fdupes

Upvotes: 2

Related Questions