Reputation: 675
In order to save space on my back-up disk, I want to "mothball" the data files that can be easily regenerated and thus don't need to be backed-up.
Currently, I'm using UNIX's "parallel" command to essentially split a large nested for-loop over many cores, with each process working on different input arguments.
# PARALLEL COMMAND CALLING mothballer.sh WITH INPUT ARGUMENTS
time parallel -j +0 --max-procs 8 "./mothballer.sh {1} {2} {3} {4} {5}" ::: {date1,date2} ::: {exp1,exp2} ::: {2,4,8} ::: {16,32,64} ::: {1,2,3,4,5}
...which interprets the command line arguments and passes them to following script, "motherballer.sh":
# reading command line arguments
date=$1
experiment=$2
parameter1=$3
parameter2=$4
trial=$5
# paths to original directory and a mirror directory in the backup server
WORK_DIR=/$WORK_MACHINE/${date}/${experiment}/${parameter1}/${parameter2}/${trial}/results
BACKUP_DIR=/$BACKUP_SERVER/${date}/${experiment}/${parameter1}/${parameter2}/${trial}/results
# create the mirror directory in the backup server
mkdir -p $BACKUP_DIR
# do the backup ("rsync" is similar to "cp")
rsync -avP $WORK_DIR/*.csv $BACKUP_DIR
# run rsync again to verify it worked; "rm" old files.
Is there a better way to this? For example, using "find"?
EDIT: Also, it would be nice to be able to use the '*' wildcard, because not all experiments have the same parameters combinations, etc. (i.e. the directories are equally deep but have different folder names). This is the biggest limitation with my current method (above).
Upvotes: 3
Views: 260
Reputation: 33685
If the command line is not too long:
time parallel ./mothballer.sh ::: */*/*/*/*
In mothballer '${date}/${experiment}/${parameter1}/${parameter2}/${trial}' will be merged to $1.
If the depth is different (zsh or newer bash):
shopt -s globstar
time parallel ./mothballer.sh ::: **/results
In mothballer '${date}/${experiment}/${parameter1}/${parameter2}/${trial}/results' will be merged to $1.
Upvotes: 2