jksl
jksl

Reputation: 323

Very slow loop using grep or fgrep on large datasets

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:

#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done

The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.

there are similar questions:

Loop Running VERY Slow :

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me. or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now) Can anyone see a faster way to do this? Thank you in advance

Upvotes: 8

Views: 5351

Answers (5)

Ole Tange
Ole Tange

Reputation: 33748

Since you are searching for simple strings (and not regexp) you may want to use comm:

comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt

It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.

On a 8 core this takes 100 sec on these files:

$ wc find_this; cat in_this.* | wc
3637371   4877980 307366868 find_this
16000000 20000000 1025893685

Be sure to have a reasonably new version of sort. It should support --parallel.

Upvotes: 1

Hari Menon
Hari Menon

Reputation: 35495

As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.

Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.

Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.

The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.

EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.

Upvotes: 2

David W.
David W.

Reputation: 107090

Here's one way to speed things up:

while read i
do
    LOOK=$(echo $i)
    fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile

When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.

By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.

If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.

while read i do LOOK=$(echo $i) find $i -type f | xargs fgrep $LOOK >> /data/output.txt done < /data/datafile

The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:

find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt

On the Mac it's

find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt

As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.

Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.

Upvotes: 1

Martin
Martin

Reputation: 38329

sounds like the -f flag for grep would be suitable here:

-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)

so grep can already do what your loop is doing, and you can replace the loop with:

grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt

Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

Upvotes: 5

Igor Chubin
Igor Chubin

Reputation: 64623

You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.

Another hint: you can combine strings that you are looking for in one regular expression. In this case grep will do only one pass for all combined lines.

Example:

Instead of

for i in ABC DEF GHI JKL
do
grep $i file >> results
done

you can do

egrep "ABC|DEF|GHI|JKL" file >> results

Upvotes: 0

Related Questions