martin
martin

Reputation: 1185

Use more than one core in bash

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.

I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.

My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.

This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:

#!/bin/bash
FILES=/home/daw/raw/*
count=0

for f in $FILES
to
  base=${f##*/}
  echo "process $f file..."
  nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
  (( count ++ ))
  if (( count = 31 )); then
        wait
        count=0
  fi
done

I'm explaining: FILES is a list of files from the raw folder.

The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.

Here readme tools: https://github.com/vsbuffalo/scythe

Does anybody know how you can handle it?

P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.

Upvotes: 4

Views: 175

Answers (1)

Mark Setchell
Mark Setchell

Reputation: 207758

IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:

parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*

This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.

Useful additional parameters are:

  • parallel --bar ... gives a progress bar
  • parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything

If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:

parallel -S server1,server2,server3 ...

Upvotes: 3

Related Questions