Reputation: 233
Scenario: S3 bucket has 1000 files. I have two machines. Each of these machines has two drives /dev/sda and /dev/sdb. Constraints: no one single drive can fit all 1000 files. And no one machine can fit all 1000 files. Desired outcome: distribute 1000 files across 4 drives on two machines using GNU parallel.
I tried things like:
parallel --xapply --joblog out.txt -S:,R echo {1} {2} ::: "/dev/sda" "/dev/sdb" ::: {0..10}
But I get:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command 2 : 1414040436.607 0.037 0 0 0 0 echo /dev/sda 1 4 : 1414040436.615 0.030 0 0 0 0 echo /dev/sda 3 6 : 1414040436.623 0.024 0 0 0 0 echo /dev/sda 5 8 : 1414040436.632 0.015 0 0 0 0 echo /dev/sda 7 10 : 1414040436.640 0.006 0 0 0 0 echo /dev/sda 9 1 R 1414040436.603 0.088 0 0 0 0 echo /dev/sdb 0 3 R 1414040436.611 0.092 0 0 0 0 echo /dev/sdb 2 5 R 1414040436.619 0.095 0 0 0 0 echo /dev/sdb 4 7 R 1414040436.628 0.095 0 0 0 0 echo /dev/sdb 6 9 R 1414040436.636 0.096 0 0 0 0 echo /dev/sdb 8 11 R 1414040436.645 0.094 0 0 0 0 echo /dev/sdb 10
Where 'R' is remote host IP. How do I distribute files (I have all names in a file) from S3 to 4 drives? Thank you.
Upvotes: 3
Views: 153
Reputation: 33685
GNU Parallel is good for starting a new job when an old has finished: It divides the jobs into servers on the fly and not beforehand.
What you are looking for is a way to do this beforehand.
Your --xapply approach seems sound, but you need to force GNU Parallel to distribute evenly to the hosts. Your current approach is dependent on how fast each host finishes, and that will not work in general.
So something like:
parallel echo {1}//{2} ::: sda sdb ::: server1 server2 | parallel --colsep '//' --xapply echo copy {3} to {1} on {2} :::: - filenames.txt
Or:
parallel --xapply echo copy {3} to {1} on {2} ::: sda sda sdb sdb ::: server1 server2 server1 server2 :::: filenames.txt
Upvotes: 1