Reputation: 639

Rsync checksum only for same size files

There's a bunch of threads regarding rsync checksum, but none seems addressing this need, which would be the most effective and fastest way to sync, at least in my case:

same time and same size ► skip file (no transfer, no checksum)
different sizes ► transfer file (no checksum)
different times and same size ► perform checksum ► transfer only if checksums differ

I noticed that the option --checksum can really take a long time to mirror a folder, if there are a lot of files. Using this option alone will run a checksum on every single file, which is very safe but very slow. Besides, it will induce read access overhead to compute the checksum.
The option --ignore-times is not what I want, if time and size both match, the chance that the files are different is insignificant, I'm willing to take the risk not to transfer.
The option --size-only is incomplete, as there is a good chance that files having same size but different times may actually be different files (eg. changing a char in another may not affect the size, just the time of edition).

Is there a way to perform the mirroring as per the combination above, with rsync (did I miss something in the manpages) or with any other Linux tools?
Thanks.

Upvotes: 30

Answers (2)

MRV

Reputation: 251

When determining whether to transfer files (or with --dry-run, whether to list files), rsync will always transfer files that differ in filesize. However, when files are the same size, rsync has several options:

with --size-only: never transfer files
with --ignore-times: always transfer files
default: if timestamps differ, transfer files
with --checksum: calculate checksums and transfer files if they differ

The behavior that you want would be a combination of the last two: "if timestamps differ, calculate checksums and transfer files if the checksums differ as well". This is not currently an option in rsync.

Unfortunately, looking at the rsync source-code, it appears it would be non-trivial to add this functionality. Currently, if checksums are used, the remote rsync gathers size, timestamp and checkstum information and sends them all together. The desired behavior would require that the remote rsync first sends over the size and timestamp, and when the local rsync determines that a checksum is needed, returns to the file to get the checksum. But the whole "remote rsync returns to the file" aspect is not present in the current code, and would first need to be written.

When you run an actual transfer, the second step can effectively be done during the transfer-process: transfer of files that do not differ is very efficient. So then the default behaviour of rsync would suffice. When using --dry-run the best approach would probably be to run rsync with default behaviour first, gather the --dry-run output, and then run rsync again, with --checksum, on the files found in the first run.

Upvotes: 25

9mjb

Reputation: 583

The short answer... it does.

same time and same size ► skip file (no transfer, no checksum)

Good and fast, but not exact, rsync offers that by default. The file could be modified and the time / size are still the same. (times can be reset) You can use -c if paranoid.

different sizes ► transfer file (no checksum)

Simplistic... what if it's a 2 gig file... and the only difference is 1 line at the end? The checksum can figure that out and spare the network traffic. You can use -c if you trust the time/size comparison.

different times and same size ► perform checksum ► transfer only if checksums differ

Of course.

I don't see it, but I remember rsync used to have an issue if there were over ... I think it was around 130,000 files. Maybe that issue was fixed.
If you do have that many files in one directory you probably have bigger problems... spread them out over different directories and do multiple rsyncs on those multiple directories.
Lots of small files (on most filesystems) have a lot of internal fragmentation issues and you might be better off archiving the files and rsyncing the archive... you need an archiver that allows updating the archive rather than re-creating it all the time.

Maybe, if not a lot of these files are updated... find the ones changed after a date (find --newer file) and then rsync just those files. (if you trust the times)

Why was this question ignored so long?

Upvotes: 5

Rsync checksum only for same size files

Answers (2)

Related Questions