Reputation: 11669

How to use rsync instead of scp in my below shell script to copy the files?

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.

I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.

Below is my problem description -

I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.

If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.

I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.

Below is my shell script which I have -

#!/bin/bash

export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers

export dir3=/testing/snapshot/20140103

do_Copy() {
  el=$1
  PRIMSEC=$2
  scp david@$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david@$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy

parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[@]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[@]}" &
wait

echo "All files copied."

Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?

Upvotes: 5

Answers (4)

peschü

Reputation: 1349

I think you can just replace

scp david@$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david@$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.

rsync david@$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david@$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data

(note that the change is not only the command)

Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

Upvotes: 0

otus

Reputation: 5732

You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.

It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.

Upvotes: 0

mc0e

Reputation: 2820

You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.

In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.

You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).

It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?

Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.

Upvotes: 2

schesis

Reputation: 59118

rsync is designed to efficiently synchronise two hierarchies of folders and files.

While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.

While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.

In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.

Upvotes: 7

How to use rsync instead of scp in my below shell script to copy the files?

Answers (4)

Related Questions