Victor
Victor

Reputation: 5807

How do I speed this up?

The following code makes a list of names and 'numbers' and gives each person a random age between 15 and 90.

#!/bin/sh

file=$1
n=$2

# if number is zero exit
if [ "$n" -eq "0" ]
then
    exit 0
fi

echo "Generating list of $n people."

for i in `seq 1 $n`;
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####" >> $file
done

echo "List generated."

With it I'm attempting to make a list of 1M names. It's slow, I expected that; it was so slow tho that I lost patience and tried 10K names. That was slow too, but it got done in a few seconds.

The reason I'm generating the names is to sort them. What surprised me is that when I sorted the list of 10K names it was instant.

How can this be?

Is there something that's making this go ungodly slow? Both the sorting and the generating are accessing files so how can the sorting be faster? Is my random number math in the list generator what's slowing it down?

Here's my sorting script.

#!/bin/sh
#first argument is list to be sorted, second is output file
tr -s '' < $1 | sort -n -k2 > $2

Upvotes: 2

Views: 1921

Answers (7)

TheBonsai
TheBonsai

Reputation: 16545

Not a new answer, just new code.

This is what IMHO is a good middle way between nice and efficient code (as efficient as you can be in Bash, it IS slow, it's a shell...)

for ((i=1;i<=n;i++));
do
  echo "name$i $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Alternative, not using a classic counter loop

i=1
while ((i<=n)); do
  echo "name$((i++)) $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Both are about the same speed.

The fixes are the same as mentioned by all the others:

  • do not frequently close and re-open the file
  • use shell arithmetics
  • ah yes, and use QUOTES, but that's for sanity, not for speed

Upvotes: 6

James Thompson
James Thompson

Reputation: 48212

Using the shell to generate random numbers like this isn't really what it was designed to do. You'll likely be better off coding something to generate random numbers from a uniform distribution in another language, like Fortran, Perl or C.

In your code, one thing that's going to be very slow is generating a sequence of numbers from 1..1e7 and assigning them all to a variable. That's likely very wasteful, but you should profile if you want to be sure. As chaos points out, appending to the file is also likely to be very costly!

In Python, you can do something like this:

#!/usr/bin/python
import random
count = 1

print ' '.join( ['name', 'age'] )
while count <= 1000000:
    age = random.randrange(15,90)
    count = count + 1
    name = 'name' + str(count)
    print ' '.join( [ name, str(age) ] )

Running that on my laptop takes ~10 seconds. Assigning the seq from 1 to 1000000 takes ~10 seconds, when you add the random number generation your script takes over three minutes on the same machine. I got frustrated just as you did, and played around with the script to try and make it faster. Here's my shortened version of your code that I'm playing with:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM >> test.txt
done

Running this takes about 5.3s:

$ time ./test.sh
real    0m5.318s
user    0m1.305s
sys     0m0.675s

Removing the file appending and simply redirecting STDOUT to a single file gives the following script:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM
done

Running this takes about half a second:

$ time ./test.sh > test.txt
real    0m0.516s
user    0m0.449s
sys     0m0.067s

The slowness of your program is at least partly due to appending to that file. Curiously, when I tried to swap the seq call with a for loop, I didn't notice any speedup.

Upvotes: 5

John Kugelman
John Kugelman

Reputation: 361977

for i in `seq 1 $n`

Yikes! This is generating 1,000,000 arguments to the for loop. That seq call will take a long, long, long time. Try

for ((i = 1; i <= n; i++))

Notice the lack of dollar signs, by the way. Peculiarly, the var++ syntax requires you to omit the dollar sign from the variable name. You are also allowed to use or to omit them elsewhere: it could be i <= n or $i <= $n, either one. The way I figure, you should omit dollar signs entirely in let, declare, and for ((x; y; z)) statements. See the ARITHMETIC EVALUATION section of the sh man page for the full explanation.

Upvotes: 5

pgs
pgs

Reputation: 13955

Try this for your main loop:

seq 1 $n | while read i
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####"
done > $file

This will make the seq and the loop work in parallel instead of waiting for the seq to finish before starting the loop. This will be faster on multiple cores/CPUs but slightly slower on a single core.

And I agree with the others here: Does it have to be bash?

Edit: add chaos' suggestion to keep the file open, not open for append for each name.

Upvotes: 2

Roger Pate
Roger Pate

Reputation:

(I have a feeling you may not like this answer, but you technically didn't specify the answer had to remain in bash! :P)

It's common to rapidly develop something in prototyping language, and then possibly switch to another language (often C) as needed. Here's a very similar program in Python for you to compare:

#!/usr/bin/python
import sys
import random

def main(args=None):
    args = args or []
    if len(args) == 1:
        # default first parameter
        args = ["-"] + args
    if len(args) != 2:
        sys.stderr.write("error: invalid parameters\n")
        return 1
    n = int(args[1])
    output = sys.stdout if args[0] == "-" else open(args[0], "a")

    for i in xrange(1, n + 1):
        num = random.randint(0, 74)
        output.write("name%s %s (###)###-####\n" % (i, num))

    sys.stderr.write("List generated.\n") # see note below

if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))

Note: Only using stdout for "real output" instead of status notifications allows this program to be run in parallel with others, piping data directly from stdout of one to stdin of another. (It's possible with special files in *nix, but just easier if you can use stdout.) Example:

$./rand_names.py 1000000 | sort -n -k2 > output_file

And it should be fast enough:

$time ./rand_names.py 1000000 > /dev/null
List generated.

real    0m16.393s
user    0m15.108s
sys     0m0.171s

Upvotes: 2

John Nilsson
John Nilsson

Reputation: 17307

I guess the '>> $file' can be the source of your problem. On my system your script takes 10 seconds to generate 10000. If I remove the $file argument and instead just use stdout and capture the whole thing to a file it takes under a second.

$ time ./gen1.sh n1.txt 10000 Generating list of 10000 people. List generated.

real 0m7.552s user 0m1.355s sys 0m1.886s

$ time ./gen2.sh 10000 > n2.txt

real 0m0.806s user 0m0.576s sys 0m0.140s

Upvotes: 4

chaos
chaos

Reputation: 124335

Don't know if it's the whole story, but re-opening the file to append to it for every name can't be helping anything. Doing the whole thing in any context where you can keep an open file handle to write to should help a lot.

Upvotes: 3

Related Questions