Reputation: 25665

Improving Shell Script Performance

This shell script is used to extract a line of data from $2 if it contains the pattern $line.

$line is constructed using the regular expression [A-Z0-9.-]+@[A-Z0-9.-]+ (a simple email match), form the lines in file $1.

#! /bin/sh

clear

for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+"`
do
    echo `cat "$2" | grep -m 1 "\b$line\b"`
done

File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).

File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).

The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.

Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).

NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.

Upvotes: 1

Answers (5)

John Kugelman

Reputation: 361625

Quick suggestions:

Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.

Thus:

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" "$1" | while read line; do
    grep -m 1 "\b$line\b" "$2"
done

Next step:

Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.

No more repeated grep:

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"

Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:

grep -f <(grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"

That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns

while [ -s patterns ]; do
    grep -f <(head -n 100 patterns) "$2"
    sed -e '1,100d' -i patterns
done

That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.

Upvotes: 7

Jan

Reputation: 2490

If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" $1

Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so

grep -i -o -E "[A-Z0-9._-]+@[A-Z0-9.-]+" $1

Upvotes: 2

Mikael Auno

Reputation: 9060

As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.

First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line

for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+"`

may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.

Upvotes: 1

Brian

Reputation: 2261

I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)

To allow for you to take the loop out First create a file of all your Email Addresses with your outer grep command. Then use this as a pattern file to do your secondary grep by using grep -f

Upvotes: 2

ghostdog74

Reputation: 342373

the problem is you are piping too many shell commands, as well as unnecessary use of cat.

one possible solution using just awk

awk 'FNR==NR{
    # get all email address from file1
    for(i=1;i<=NF;i++){
        if ( $i ~ /[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+/){
            email[$i]
        }
    }
    next
}
{
 for(i in email) {
    if ($0 ~ i) {
        print 
    }
 }
}' file1 file2

Upvotes: 3

Improving Shell Script Performance

Answers (5)

Related Questions