Reputation: 119
I'm working for an academic paper on some bioinformatics work (and I'll quote it as asked by the author ;) ) and I need to speed up my bash.
It's basically a bash
script which runs a loop to iterate files and look for strings with awk
.
I followed the manual and using parallel -a ./script.sh
. I had issues with the variable so I changed it for -q
and there it seems that the script does not start at all, although I have no error message.
I'm probably doing something wrong, but I don't get what. Previously, I had to pipe it with ::: because I had an input file, but this script does not have any.
The script :
#!/bin/bash
files_chrM_ID="concat_chrM_*"
bam_directory="../bam/"
for ID_file in ${files_chrM_ID}
do
echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
echo "$(date +%H:%I:%S) $sample is being treated"
for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
do
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"
samtools view -@ 6 $bam_file_target | awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
done
done
and my command :
parallel -q ./script.sh
Upvotes: 1
Views: 1072
Reputation: 33685
GNU Parallel is not magic: You cannot tell it to parallelize any script.
Instead you need to tell it what to parallelize and how.
In general you need to think that you have to generate a list of commands that you want run in parallel and then give this list to GNU Parallel.
In the script you have 2 for
loops and a pipe. All three can be parallelized by using GNU Parallel. It is, however, not certain it will make sense: There is an overhead in parallelizing and if the current implementation utilized the CPU and disk resources optimally, then you will not see a speedup by parallelizing.
A for
loop like this
for x in x-value1 x-value2 x-value3 ... x-valueN; do
# do something to $x
done
is parallelized by:
myfunc() {
x="$1"
# do something to $x
}
export -f myfunc
parallel myfunc ::: x-value1 x-value2 x-value3 ... x-valueN
A pipe in the form of A | B | C
where B
is slow is parallelized by:
A | parallel --pipe B | C
So start by identifying the bottleneck.
For this top
is really useful. If you see a single process running 100% in top
that is a good candidate for parallelizing.
If not, then you may be limited by how fast your disk is, and that can rarely be sped up by GNU Parallel.
You have not included test data, so I cannot run your script and identify the bottleneck for you. But I have experience with samtools
and samtools view
was always the bottleneck in my scripts. So let us assume that is also the case here.
samtools ... | awk ...
This is does not fit the A | B | C
template where B
is slow, so we cannot use parallel --pipe
to speed that up. If, however, awk
is the bottleneck, then we can use parallel --pipe
.
So let us instead look at the two for
loops.
It is easy to parallelize the outer loop:
#!/bin/bash
files_chrM_ID="concat_chrM_*"
do_chrM() {
ID_file="$1"
bam_directory="../bam/"
echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
echo "$(date +%H:%I:%S) $sample is being treated"
for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
do
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"
samtools view -@ 6 $bam_file_target | awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
done
}
export -f do_chrM
parallel do_chrM ::: ${files_chrM_ID}
This is great if there are more ${files_chrM_ID}
than there are CPU threads. But if that is not the case, we also need to parallelize the inner loop.
This is slightly trickier because we need to export a few variables to make them visible to do_bam
which is called by parallel
:
#!/bin/bash
files_chrM_ID="concat_chrM_*"
do_chrM() {
ID_file="$1"
bam_directory="../bam/"
echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
# We need to export $sample and $ID_file to make them visible to do_bam()
export sample
export ID_file
echo "$(date +%H:%I:%S) $sample is being treated"
do_bam() {
bam_file_target="$1"
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"
samtools view -@ 6 $bam_file_target |
awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
}
export -f do_bam
parallel do_bam ::: "${bam_directory}"*"${sample}"*".bam"
}
export -f do_chrM
parallel do_chrM ::: ${files_chrM_ID}
This, however, may overload your server: The inner parallel does not communicate with the outer parallel so if you run this on a 64 core machine you risk running 64*64 jobs in parallel (but only if there are enough files matching concat_chrM_*
and "${bam_directory}"*"${sample}"*".bam"
).
In that case it will make sense to limit the outer parallel
to 1 or 2 jobs in parallel:
parallel -j2 do_chrM ::: ${files_chrM_ID}
This will at most run 2*64 jobs in parallel on a 64-core machine.
If, however, you want to run 64 jobs in parallel all the time then it becomes quite a bit trickier: It would have been fairly simple if the values of the inner loop did not depend on the outer loop, because then you could simply have done something like:
parallel do_stuff ::: chrM_1 ... chrM_100 ::: bam1.bam ... bam100.bam
which would generate all combinations of chrM_X,bamY.bam and run those in parallel - 64 at a time on a 64-core machine.
But in your case the values in the inner loop do depend on the values in the outer loop. This means you need to compute the values before starting any jobs. This also means you cannot have your script output information in the outer loop.
#!/bin/bash
sam_awk() {
bam_file_target="$1"
sample="$2"
ID_File="$3"
echo "$(date +%H:%I:%S) $ID_file is being treated"
echo "$(date +%H:%I:%S) $sample is being treated"
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"
samtools view -@ 6 $bam_file_target |
awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
}
files_chrM_ID="concat_chrM_*"
bam_directory="../bam/"
for ID_file in ${files_chrM_ID}
do
# Moved to inner
# echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
# Moved to inner
# echo "$(date +%H:%I:%S) $sample is being treated"
for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
do
echo "$bam_file_target"
echo "$sample"
echo "$ID_File"
done
done | parallel -n3 sam_awk
Given that you have not given us any test data, I cannot test whether these scripts will actually run, so there may be errors in them.
If you have not already done so, read at least chapter 1+2 of "GNU Parallel 2018" (available at http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or download it at: https://doi.org/10.5281/zenodo.1146014)
It should take you less than 20 minutes and your command line will love you for it.
Upvotes: 2