Reputation: 193
Rsync is one of the first things we learn when we get into Linux. I've been using it forever to move files around.
At my current job, we manage petabytes of data, and we constantly have to move HUGE amounts of data around on daily bases.
I was shown a source folder called a/
that has 8.5GB of data, and a destination folder called b/
(a is remote mount, b is local on the machine).
my simple command took a little over 2 minutes:
rsync -avr a/ b/
Then, I was shown that by doing the following multi-thread approach, it took a 7 seconds: (in this example 10 threads were used)
cd a; ls -1 | xargs -n1 -P10 -I% rsync -ar % b/
Because of the huge time efficiency, every time we have to copy data from one place to another (happens almost daily), I'm required to over-engineer a simple rsync so that it would be able to use rsync with multi-thread similar to the second example above.
This section is about why I can't just use the example above every time, it can be skipped.
The reason I have to over engineer it, and the reason why i can't just always do cd a; ls -1 | xargs -n1 -P10 -I% rsync -ar % b/
every time, is because cases where the folder structure is like this:
jeff ws123 /tmp $ tree -v
.
└── a
└── b
└── c
├── file1
├── file2
├── file3
├── file4
├── file5
├── file6
├── file7
├── file8
├── file9
├── file10
├── file11
├── file12
├── file13
├── file14
├── file15
├── file16
├── file17
├── file18
├── file19
└── file20
I was told since a/
has only one thing in it (b/
), it wouldn't really use 10 threads, but rather 1, as there's only 1 file/folder in it.
It's starting to feel like 40% of my job is to break my head on making case-specific "efficient" rsyncs, and I just feel like I'm doing it all wrong. Ideally, I could just do something like rsync source/ dest/ --threads 10
and let rsync do the hard work.
Am I looking at all this the wrong way? Is there a simple way to copy data with multi-threads in a single line, similar to the example in the line above?
Thanks ahed!
Upvotes: 4
Views: 18616
Reputation: 27205
If nearly all of the files are very big, you can try the following to make better use of your extremely fast network:
( cd a/ && find . -type f -print0 | xargs -0 -P10 -I% rsync -avR % ../b/; )
Here we used cd a/
and -R
/--relative
to preserve the paths. Because of the cd
we had to adapt the relative path b/
.
However, if there are lots of small files too, this will most likely be slower than just rsync -a a/ b/
as we start a new process for each of the small files. Also, this won't work with options like --delete
since find
cannot list files that are already deleted.
So in general, I'd recommend to transfer the big files first in parallel, and then run one final rsync
for the small files and optional things like --delete
:
( cd a/ && find . -type f -size +1G -print0 | xargs -0 -P10 -I% rsync -avR % ../b/; )
rsync -av a/ b/
To make this more usable, you can write a function or script. This one also handles the difference between src
and src/
and allows to add additional options after dest
.
Example usage: prsync src/ dest/ -v --delete
#! /bin/bash
isremote() {
[[ "${1%%/*}" == *: ]]
}
prsync() {
local src="$1" dest="$2"
shift 2 || { echo "missing arguments" >&2; return 1; }
isremote "$src" && { echo "cannot handle remote source" >&2; return 1; }
(
isremote "$dest" || [[ "$dest" == /* ]] || dest="$PWD/$dest"
[[ "$src" == */ ]] || dest="$dest/${src##*/}"
cd "$src" &&
find . -type f -size +1G -print0 |
xargs -0 -P10 -I% rsync -aR "$@" % "$dest";
) &&
rsync -a "$@" "$src" "$dest"
}
To speed this up a bit, consider adding --max-size=1G
to the last rsync
. However, this might be dangerous, as I don't know if find -size +1G
and rsync --max-size=1G
use the same notion of "size", especially for sparse files and compressed file systems.
Upvotes: 5