user4918296
user4918296

Reputation:

Force rsync to compare local files byte by byte instead of checksum

I have written a Bash script to backup a folder. At the core of the script is an rsync instruction

rsync -abh --checksum /path/to/source /path/to/target

I am using --checksum because I neither want to rely on file size nor modification time to determine if the file in the source path needs to be backed up. However, most -- if not all -- of the time I run this script locally, i.e., with an external USB drive attached which contains the backup destination folder; no backup over network. Thus, there is no need for a delta transfer since both files will be read and processed entirely by the same machine. Calculating the checksums even introduces a speed down in this case. It would be better if rsync would just diff the files if they are both on stored locally.

After reading the manpage I stumbled upon the --whole-file option which seems to avoid the costly checksum calculation. The manpage also states that this is the default if source and destination are local paths.

So I am thinking to change my rsync statement to

rsync -abh /path/to/source /path/to/target

Will rsync now check local source and target files byte by byte or will it use modification time and/or size to determine if the source file needs to be backed up? I definitely do not want to rely on file size or modification times to decide if a backup should take place.

UPDATE

Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced. So blindly rsync'ing all files in the source folder, e.g., by supplying --ignore-times as suggested in the comments, is not an option. It would create too many duplicate files and waste storage space. Keep also in mind that I am trying to reduce backup time and workload on a local machine. Just backing up everything would defeat that purpose.

So my question could be rephrased as, is rsync capable of doing a file comparison on a byte by byte basis?

Upvotes: 7

Views: 9523

Answers (2)

Erki Aring
Erki Aring

Reputation: 2096

There is no way to do byte-by-byte comparison of files instead of checksum, the way you are expecting it.

The way rsync works is to create two processes, sender and receiver, that create a list of files and their metadata to decide with each other, which files need to be updated. This is done even in case of local files, but in this case processes can communicate over a pipe, not over a network socket. After the list of changed files is decided, changes are sent as a delta or as whole files.

Theoretically, one could send whole files in the file list to the other to make a diff, but in practice this would be rather inefficient in many cases. Receiver would need to keep these files in the memory in case it detects the need to update the file, or otherwise the changes in files need to be re-sent. Any of the possible solutions here doesn't sound very efficient.

There is a good overview about (theoretical) mechanics of rsync: https://rsync.samba.org/how-rsync-works.html

Upvotes: 1

hmedia1
hmedia1

Reputation: 6200

Question: is rsync capable of doing a file comparison on a byte by byte basis?

Strictly speaking, Yes:

  • It's a block by block comparison, but you can change the block size.
  • You could use --block-size=1, (but it would be unreasonably inefficient and inappropriate for basically every)

The block based rolling checksum is the default behavior over a network.

Use the --no-whole-file option to force this behavior locally. (see below)

Statement 1. Calculating the checksums even introduces a speed down in this case.

This is why it's off by default for local transfers.

Using the --checksum option forces an entire file read, as opposed to the default block-by-block delta-transfer checksum checking

Statement 2. Will rsync now check local source and target files byte by byte or
       will it use modification time and/or size to determine if the source file        needs to be backed up?

By default it will use size & modification time.

You can use a combination of --size-only, --(no-)ignore-times, --ignore-existing and
--checksum to modify this behavior.

Statement 3. I definitely do not want to rely on file size or modification times to decide if a        backup should take place.

Then you need to use --ignore-times and/or --checksum

Statement 4. supplying --ignore-times as suggested in the comments, is not an option

Perhaps using --no-whole-file and --ignore-times is what you want then ? This forces the use of the delta-transfer algorithm, but for every file regardless of timestamp or size.

You would (in my opinion) only ever use this combination of options if it was critical to avoid meaningless writes (though it's critical that it's specifically the meaningless writes that you're trying to avoid, not the efficiency of the system, since it wouldn't actually be more efficient to do a delta-transfer for local files), and had reason to believe that files with identical modification stamps and byte size could indeed be different.

I fail to see how modification stamp and size in bytes is anything but a logical first step in identifying changed files.

If you compared the following two files:

  • File 1 (local) : File.bin - 79776451 bytes and modified on the 15 May 07:51
  • File 2 (remote): File.bin - 79776451 bytes and modified on the 15 May 07:51

The default behaviour is to skip these files. If you're not satisfied that the files should be skipped, and want them compared, you can force a block-by-block comparison and differential update of these files using --no-whole-file and --ignore-times

So the summary on this point is:

  1. Use the default method for the most efficient backup and archive
  2. Use --ignore-times and --no-whole-file to force delta-change (block by block checksum, transferring only differential data) if for some reason this is necessary
  3. Use --checksum and --ignore-times to be completely paranoid and wasteful.

Statement 5. Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced

Yes, but this can work however you want it to, it doesn't necessarily mean a full backup every time a file is updated, and it certainly doesn't mean that a full transfer will take place at all.

You can configure rsync to:

  • Keep 1 or more versions of a file
  • Configure it with a --backup-dir to be a full incremental backup system.

Doing it this way doesn't waste space other than what is required to retain differential data. I can verify that in practise as there would not be nearly enough space on my backup drives for all of my previous versions to be full copies.


Some Supplementary Information


Why is Delta-transfer not more efficient than copying the whole file locally?

Because you're not tracking the changes to each of your files. If you actually have a delta file, you can merge just the changed bytes, but you need to know what those changed bytes are first. The only way you can know this is by reading the entire file

For example:

  • I modify the first byte of a 10MB file.
  • I use rsync with delta-transfer to sync this file
  • rsync immediately sees that the first byte (or byte within the first block) has changed, and proceeds (by default --inplace) to change just that block
  • However, rsync doesn't know it was only the first byte that's changed. It will keep checksumming until the whole file is read

For all intents and purposes:

  • Consider rsync a tool that conditionally performs a --checksum based on whether or not the file timestamp or size has changed. Overriding this to --checksum is essentially equivalent to --no-whole-file and --ignore-times, since both will:
    • Operate on every file, regardless of time and size
    • Read every block of the file to determine which blocks to sync.

What's the benefit then?

The whole thing is a tradeoff between transfer bandwidth, and speed / overhead.

  • --checksum is a good way to only ever send differences over a network
  • --checksum while ignoring files with the same timestamp and size is a good way to both only send differences over a network, and also maximize the speed of the entire backup operation

Interestingly, it's probably much more efficient to use --checksum as a blanket option than it would be to force a delta-transfer for every file.

Upvotes: 7

Related Questions