Reputation: 639
There's a bunch of threads regarding rsync checksum, but none seems addressing this need, which would be the most effective and fastest way to sync, at least in my case:
I noticed that the option --checksum
can really take a long time to mirror a folder, if there are a lot of files. Using this option alone will run a checksum on every single file, which is very safe but very slow. Besides, it will induce read access overhead to compute the checksum.
The option --ignore-times
is not what I want, if time and size both match, the chance that the files are different is insignificant, I'm willing to take the risk not to transfer.
The option --size-only
is incomplete, as there is a good chance that files having same size but different times may actually be different files (eg. changing a char in another may not affect the size, just the time of edition).
Is there a way to perform the mirroring as per the combination above, with rsync (did I miss something in the manpages) or with any other Linux tools?
Thanks.
Upvotes: 30
Views: 17185
Reputation: 251
When determining whether to transfer files (or with --dry-run
, whether to list files), rsync will always transfer files that differ in filesize. However, when files are the same size, rsync has several options:
--size-only
: never transfer files--ignore-times
: always transfer files--checksum
: calculate checksums and transfer files if they differThe behavior that you want would be a combination of the last two: "if timestamps differ, calculate checksums and transfer files if the checksums differ as well". This is not currently an option in rsync.
Unfortunately, looking at the rsync source-code, it appears it would be non-trivial to add this functionality. Currently, if checksums are used, the remote rsync gathers size, timestamp and checkstum information and sends them all together. The desired behavior would require that the remote rsync first sends over the size and timestamp, and when the local rsync determines that a checksum is needed, returns to the file to get the checksum. But the whole "remote rsync returns to the file" aspect is not present in the current code, and would first need to be written.
When you run an actual transfer, the second step can effectively be done during the transfer-process: transfer of files that do not differ is very efficient. So then the default behaviour of rsync would suffice. When using --dry-run
the best approach would probably be to run rsync with default behaviour first, gather the --dry-run
output, and then run rsync again, with --checksum
, on the files found in the first run.
Upvotes: 25
Reputation: 583
The short answer... it does.
same time and same size ► skip file (no transfer, no checksum)
Good and fast, but not exact, rsync offers that by default. The file could be modified and the time / size are still the same. (times can be reset) You can use -c if paranoid.
different sizes ► transfer file (no checksum)
Simplistic... what if it's a 2 gig file... and the only difference is 1 line at the end? The checksum can figure that out and spare the network traffic. You can use -c if you trust the time/size comparison.
different times and same size ► perform checksum ► transfer only if checksums differ
Of course.
I don't see it, but I remember rsync used to have an issue if there were over ... I think it was around 130,000 files. Maybe that issue was fixed.
If you do have that many files in one directory you probably have bigger problems... spread them out over different directories and do multiple rsyncs on those multiple directories.
Lots of small files (on most filesystems) have a lot of internal fragmentation issues and you might be better off archiving the files and rsyncing the archive... you need an archiver that allows updating the archive rather than re-creating it all the time.
Maybe, if not a lot of these files are updated... find the ones changed after a date (find --newer file) and then rsync just those files. (if you trust the times)
Why was this question ignored so long?
Upvotes: 5