SCO
SCO

Reputation: 1932

Can't start jobs with GNU Parallel

I'm running a 32 cores machine, and I wish to parallelize a very simple operation. Given a ip_addresses.txt file such as this :

1.2.3.4
8.8.8.8
120.120.120.120

I'd like to resolve these IPS using a script, called script.sh that resolves the IPs to their respective ISPs. It is given an IP, and outputs the following, for example when given 1.2.3.4, which is fine :

echo 1.2.3.4 | ./script.sh
1.2.3.4|Google

The ip_addresses.txt contains multi-million unique IPs, and I was thinking about parallelizing the call to the script. So I tried this :

cat ip_addresses.txt | parallel ./script.sh

But there is not output. I'd expect to have :

1.2.3.4|Google
120.120.120.120|Taiwan Academic Network

This way I can redirect them to a file.

My script is as follow :

#!/bin/bash
while read ip
do
  ret=$(/home/sco/twdir/product/trunk/ext/libmaxminddb-1.0.3/bin/mmdblookup --file /home/sco/twdir/product/trunk/ext/libmaxminddb-1.0.3/GeoIP2-ISP.mmdb --ip $ip isp 2>/dev/null |  grep -v '^$' | grep -v '^  Could not find' | cut -d "\"" -f 2)
  [[ $ret != "" ]] &&  echo -n "$ip|" && echo $ret;
done

What did I miss ? Although I checked tutorials, I can't sort this out.

Upvotes: 1

Views: 541

Answers (2)

d2207197
d2207197

Reputation: 1384

Quoted from parallel man page.

For each line of input GNU parallel will execute command with the line as arguments.

each line from the input will be command line arguments of the script, not standard input. like this:

./script.sh 1.2.3.4

you should rewrite your script for reading argument from variable $1.

#!/bin/bash
ip=$1
ret=$(/home/sco/twdir/product/trunk/ext/libmaxminddb-1.0.3/bin/mmdblookup --file /home/sco/twdir/product/trunk/ext/libmaxminddb-1.0.3/GeoIP2-ISP.mmdb --ip $ip isp 2>/dev/null |  grep -v '^$' | grep -v '^  Could not find' | cut -d "\"" -f 2)
[[ $ret != "" ]] &&  echo -n "$ip|" && echo $ret;

or you can use the --pipe option of parallel.

$ cat ip_addresses.txt | parallel --pipe --block-size 10 ./script.sh

Upvotes: 0

Ole Tange
Ole Tange

Reputation: 33685

Your script reads multiple lines from standard input (STDIN). GNU Parallel defaults to putting the argument on the command line. To make GNU Parallel give the input on STDIN use --pipe.

cat ip_addresses.txt | parallel --pipe ./script.sh

This will run one job per core, and pass each job 1 MB of data. But looking up addresses is not really CPU hard, so you might run 10 jobs per CPU (1000%):

cat ip_addresses.txt | parallel -j 1000% --pipe ./script.sh

That might hit your file handle limit, so:

cat ip_addresses.txt |\
  parallel --pipe --block 50m --round-robin -j100 parallel --pipe -j50 ./script.sh

This will run 100*50 = 5000 jobs in parallel.

If you do not want to wait for a full 1 MB to be processed before you get any output, you can lower that to 1k:

cat ip_addresses.txt | parallel -j 1000% --pipe --block-size 1k ./script.sh

cat ip_addresses.txt |\
  parallel --pipe --block 50k --round-robin -j100 parallel --pipe --block 1k -j50 ./script.sh

Upvotes: 1

Related Questions