GNU parallel and script not starting

Question

I'm working for an academic paper on some bioinformatics work (and I'll quote it as asked by the author ;) ) and I need to speed up my bash.

It's basically a bash script which runs a loop to iterate files and look for strings with awk.

I followed the manual and using parallel -a ./script.sh. I had issues with the variable so I changed it for -q and there it seems that the script does not start at all, although I have no error message.

I'm probably doing something wrong, but I don't get what. Previously, I had to pipe it with ::: because I had an input file, but this script does not have any.

The script :

#!/bin/bash
files_chrM_ID="concat_chrM_*"
bam_directory="../bam/"
for ID_file in ${files_chrM_ID}
do
    echo "$(date +%H:%I:%S) $ID_file is being treated"
    sample=${ID_file: -12}
    sample=${sample:0:8}
    echo "$(date +%H:%I:%S) $sample is being treated"
    for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
    do
        echo $bam_file_target // $sample
        out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
        echo "$out_file will be created"
        echo "samtools and awk starting"
        
        samtools view -@ 6 $bam_file_target | awk -v st="$ID_file" 'BEGIN {OFS="	";ORS="
"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
        echo "$out_file done."
    done
done

and my command :

parallel -q ./script.sh

Ole Tange · Accepted Answer

GNU Parallel is not magic: You cannot tell it to parallelize any script.

Instead you need to tell it what to parallelize and how.

In general you need to think that you have to generate a list of commands that you want run in parallel and then give this list to GNU Parallel.

In the script you have 2 for loops and a pipe. All three can be parallelized by using GNU Parallel. It is, however, not certain it will make sense: There is an overhead in parallelizing and if the current implementation utilized the CPU and disk resources optimally, then you will not see a speedup by parallelizing.

A for loop like this

for x in x-value1 x-value2 x-value3 ... x-valueN; do
  # do something to $x
done

is parallelized by:

myfunc() {
  x="$1"
  # do something to $x
}
export -f myfunc
parallel myfunc ::: x-value1 x-value2 x-value3 ... x-valueN

A pipe in the form of A | B | C where B is slow is parallelized by:

A | parallel --pipe B | C

So start by identifying the bottleneck.

For this top is really useful. If you see a single process running 100% in top that is a good candidate for parallelizing.

If not, then you may be limited by how fast your disk is, and that can rarely be sped up by GNU Parallel.

You have not included test data, so I cannot run your script and identify the bottleneck for you. But I have experience with samtools and samtools view was always the bottleneck in my scripts. So let us assume that is also the case here.

samtools ... | awk ...

This is does not fit the A | B | C template where B is slow, so we cannot use parallel --pipe to speed that up. If, however, awk is the bottleneck, then we can use parallel --pipe.

So let us instead look at the two for loops.

It is easy to parallelize the outer loop:

#!/bin/bash
files_chrM_ID="concat_chrM_*"

do_chrM() {
    ID_file="$1"
    bam_directory="../bam/"
    echo "$(date +%H:%I:%S) $ID_file is being treated"
    sample=${ID_file: -12}
    sample=${sample:0:8}
    echo "$(date +%H:%I:%S) $sample is being treated"
    for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
    do
        echo $bam_file_target // $sample
        out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
        echo "$out_file will be created"
        echo "samtools and awk starting"
        
        samtools view -@ 6 $bam_file_target | awk -v st="$ID_file" 'BEGIN {OFS="	";ORS="
"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
        echo "$out_file done."
    done
}
export -f do_chrM

parallel do_chrM ::: ${files_chrM_ID}

This is great if there are more ${files_chrM_ID} than there are CPU threads. But if that is not the case, we also need to parallelize the inner loop.

This is slightly trickier because we need to export a few variables to make them visible to do_bam which is called by parallel:

#!/bin/bash
files_chrM_ID="concat_chrM_*"

do_chrM() {
    ID_file="$1"
    bam_directory="../bam/"
    echo "$(date +%H:%I:%S) $ID_file is being treated"
    sample=${ID_file: -12}
    sample=${sample:0:8}
    # We need to export $sample and $ID_file to make them visible to do_bam()
    export sample
    export ID_file
    echo "$(date +%H:%I:%S) $sample is being treated"
    do_bam() {
        bam_file_target="$1"
        echo $bam_file_target // $sample
        out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
        echo "$out_file will be created"
        echo "samtools and awk starting"
        
        samtools view -@ 6 $bam_file_target | 
          awk -v st="$ID_file" 'BEGIN {OFS="	";ORS="
"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
        echo "$out_file done."
    }
    export -f do_bam
    parallel do_bam ::: "${bam_directory}"*"${sample}"*".bam"
}
export -f do_chrM

parallel do_chrM ::: ${files_chrM_ID}

This, however, may overload your server: The inner parallel does not communicate with the outer parallel so if you run this on a 64 core machine you risk running 64*64 jobs in parallel (but only if there are enough files matching concat_chrM_* and "${bam_directory}"*"${sample}"*".bam").

In that case it will make sense to limit the outer parallel to 1 or 2 jobs in parallel:

parallel -j2 do_chrM ::: ${files_chrM_ID}

This will at most run 2*64 jobs in parallel on a 64-core machine.

If, however, you want to run 64 jobs in parallel all the time then it becomes quite a bit trickier: It would have been fairly simple if the values of the inner loop did not depend on the outer loop, because then you could simply have done something like:

parallel do_stuff ::: chrM_1 ... chrM_100 ::: bam1.bam ... bam100.bam

which would generate all combinations of chrM_X,bamY.bam and run those in parallel - 64 at a time on a 64-core machine.

But in your case the values in the inner loop do depend on the values in the outer loop. This means you need to compute the values before starting any jobs. This also means you cannot have your script output information in the outer loop.

#!/bin/bash

sam_awk() {
        bam_file_target="$1"
        sample="$2"
        ID_File="$3"

        echo "$(date +%H:%I:%S) $ID_file is being treated"
        echo "$(date +%H:%I:%S) $sample is being treated"

        echo $bam_file_target // $sample
        out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
        echo "$out_file will be created"
        echo "samtools and awk starting"

        samtools view -@ 6 $bam_file_target |
          awk -v st="$ID_file" 'BEGIN {OFS="	";ORS="
"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file       
        echo "$out_file done."
}

files_chrM_ID="concat_chrM_*"
bam_directory="../bam/"
for ID_file in ${files_chrM_ID}
do
# Moved to inner
#    echo "$(date +%H:%I:%S) $ID_file is being treated"
    sample=${ID_file: -12}
    sample=${sample:0:8}
# Moved to inner
#    echo "$(date +%H:%I:%S) $sample is being treated"
    for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
    do
        echo "$bam_file_target"
        echo "$sample"
        echo "$ID_File"
    done
done | parallel -n3 sam_awk

Given that you have not given us any test data, I cannot test whether these scripts will actually run, so there may be errors in them.

If you have not already done so, read at least chapter 1+2 of "GNU Parallel 2018" (available at http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or download it at: https://doi.org/10.5281/zenodo.1146014)

It should take you less than 20 minutes and your command line will love you for it.

GNU parallel and script not starting

Answers (1)

Related Questions