Stefaan
Stefaan

Reputation: 4916

Rsync with --checksum from local to local?

I'll first try situate the problem a bit. We have a project that is build to a large tree of files. The build is several hundreds of MB, contains lots of (smallish) files, only a small fraction of which change between builds. We want to preserve a bit of history of these builds, and to do this efficiently we want to hardlink files that don't change between builds. For this we use rsync (as the more powerful brother of cp), from a local source to a local target with option --link-dest for doing the hardlinking magic.

This works fine for incremental builds: most files are not touched and rsync does the hardlink trick correctly. With full recompile builds (which we have to do for reasons that are not relevant here), things don't seem to work as expected. Because of the recompile, all files get a fresh timestamp, but content-wise, most files are still the same as the previous build. But even though we use rsync with the --checksum option (so rsync "syncs"/hardlinks based on content, not filesize+timestamp), nothing gets hardlinked anymore.

Illustration

I tried to isolate/illustrate the problem with this simple (bash) script:

echo "--- Start clean"
rm -fr src build*

echo "--- Set up src"
mkdir src
echo hello world > src/helloworld.txt

echo "--- First copy with src as hardlink reference"
rsync -a --checksum --link-dest=$(pwd)/src src/ build1/

echo "--- Second copy with first copy as hardlink reference"
rsync -a --checksum --link-dest=$(pwd)/build1 src/ build2/

echo "--- Result (as expected)"
ls -ali src/helloworld.txt build*/helloworld.txt

echo "--- Sleep to have reasonable timestamp differences"
sleep 2

echo "--- 'Remake' src, but with same content"
rm -fr src/helloworld.txt
echo hello world > src/helloworld.txt

echo "Third copy with second copy as hardlink reference"
rsync -a --checksum --link-dest=$(pwd)/build2 src/ build3
# Using --modify-window=10 gives results as expected
# rsync -a --modify-window=10 --link-dest=$(pwd)/build2 src/ build3

echo "Final result, not as expected"
ls -ali src/helloworld.txt build*/helloworld.txt

The first result is as expected: all three copies are hardlinked (same inode)

30157018 -rw-r--r--  3 stefaan  staff  12 May 10 01:28 build1/helloworld.txt
30157018 -rw-r--r--  3 stefaan  staff  12 May 10 01:28 build2/helloworld.txt
30157018 -rw-r--r--  3 stefaan  staff  12 May 10 01:28 src/helloworld.txt

The final result is not as expected/desired:

30157018 -rw-r--r--  2 stefaan  staff  12 May 10 01:28 build1/helloworld.txt
30157018 -rw-r--r--  2 stefaan  staff  12 May 10 01:28 build2/helloworld.txt
30157026 -rw-r--r--  1 stefaan  staff  12 May 10 01:28 build3/helloworld.txt
30157024 -rw-r--r--  1 stefaan  staff  12 May 10 01:28 src/helloworld.txt

The third copy build3/helloworld.txt is not hardlinked to the one from build2, even though the content is the same, so the checksum check should see this.

Question

Anybody has a idea what is wrong here? Is my expectation wrong? Or is rsync ignoring the --checksum option when syncing from local to local, for example because it knowns looking at inode numbers is smarter than spending time on checksums?

Upvotes: 5

Views: 6904

Answers (1)

neonsignal
neonsignal

Reputation: 31

The issue is that using the '-a' flag forces the modification times to be preserved (implicitly, '-t').

If you use '-rlpgo' instead (or follow the '-a' with '--no-times'), the modification times will no longer be considered for preservation, so the inode will be shared. You will still have to specify either '--size-only' or '--checksum' (the latter is obviously safer), so that it doesn't do a comparison based on the file times.

The documentation doesn't distinguish clearly between which flags are used to trigger updates, and which are used to control the preservation of attributes.

Upvotes: 3

Related Questions