user3289556
user3289556

Reputation: 175

Parallel execution of Bash commands

I have a Bash script that has a loop inside of which there is a Bash command that calls another Bash script which in turn calls Python scripts.

Each of these bash commands within the loops could be run independently from each other. When I later run it on an actual dataset, it takes some time to execute each command. Therefore, I would like to take advantage and parallelize this part of the script.

I spent a few days going over options in Bash that do parallel execution, while also giving me the option to choose the number of cores that I want to parallelize the code such that I wont flood the server. After looking for options the GNU, xargs -P seemed to me the most reasonable, since I do not have to have a specific Bash version and it will work without installing extra libraries. However I am having difficulties making it work, even though it seems straight forward.

#!/bin/bash

while getopts i:t: option
do
case "${option}"
in
    i) in_f=${OPTARG};;
    t) n_threads=${OPTARG};;
esac
done    

START=$(date +%s)
class_file=$in_f
classes=( $(awk '{print $1}' ./$class_file))
rm -r tree_matches.txt
n="${#classes[@]}"
for i in $(seq 0  $n);
   do
     for j in $(seq $((i+1)) $((n-1)));
         do
            echo ${classes[i]}"    "${classes[j]} >> tree_matches.txt
         done
   done
col1=( $(awk '{print $1}' ./tree_matches.txt ))
col2=( $(awk '{print $2}' ./tree_matches.txt ))


printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

n_pairs="${#col1[@]}"

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Exec time $DIFF seconds"

You can ignore the initial two nested loops, I just pasted the entire script for completeness. The part that is going to be parallelized is the 4th line of code counting from the end of the script:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

This will loop over all pairs which is in my case 1275 in total and will ideally execute myFunction.sh in parallel with the specified number of threads using the variable $n_threads.

However, I am doing something wrong because the iterator k in that line is not indexing my two arrays ${classes[k]} and ${classes[k]}.

The loop keeps iterating 1275 times but it only indexes the first element of both arrays when I echo them. I later changed that line to this for troubleshooting:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" k

It is actually incrementing the value of k each time it loops, however when I change that line to this:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" "$((k))"

it is printing out 0, 1275 times as the value for k. I don't know what I'm doing wrong.

I actually have two vectors that are the same sizes and are input for myFunction.sh script. I just want an integer index to be able to index them at the same time and call my function with those two values that are indexed from those two vectors. I modified my code as follows based on your suggestion:

 for x in {0..10};
    do
        printf "%d\0" "$x"; done| xargs -0 -I @@ -P $n_threads sh markerGenes2TreeMatch.sh -1 ${col1[@@]}-2 ${col2[@@]}

however now when I execute the code I get the following error:

@@: syntax error: operand expected (error token is "@@")

I guess this index @@ is still in string format. I just want integer indices to be generated by as I loop and can execute this command in parallel.

Upvotes: 1

Views: 827

Answers (3)

Ole Tange
Ole Tange

Reputation: 33685

With GNU Parallel you could probably do:

classes=( $(awk '{print $1}' ./$class_file))
parallel markerGenes2TreeMatch.sh -1 {=1 'if($arg[1] ge $arg[2]) { skip() }' =} -2 {2} ::: ${classes[@]} ::: ${classes[@]}

or:

parallel --plus markerGenes2TreeMatch.sh -1 {1choose_k} -2 {2choose_k} ::: ${classes[@]} ::: ${classes[@]}

Then you can skip the whole generation of tree_match.txt, and $col1/$col2.

Use parallel --embed to include GNU Parallel directly in your script, so you do not have external dependencies.

Upvotes: 0

Aaron Digulla
Aaron Digulla

Reputation: 328566

This line isn't working as you think it is:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

What happens is that BASH will first expand things like $n_threads and ${classes[k]} into strings and then calls xargs. Btw. ${classes[k]} is always "" since the key "k" isn't in the array classes. Try ${classes[$k]}; then BASH will substitute the variable k first, then use the result to look up a value in classes.

Maybe a better approach would be write the values from classes into a file and use that as input for xargs. You may have to change myFunction.sh to accept a single argument (= one line of input) and take it apart in the script.

Upvotes: 0

jhnc
jhnc

Reputation: 16662

For the line in question:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

${classes[k]} will be expanded by the shell (to nothing most likely), before xargs has a chance to see it.

Perhaps you could reorder to:

for x in {0..1275}; do printf "%s\0" "${classes[$x]}"; done |\
xargs -0 -I @@ -P $n_threads sh myFunction.sh -1 @@ -2 @@

Upvotes: 1

Related Questions