fabioln79
fabioln79

Reputation: 395

Faster grep function for big (27GB) files

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB). To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).

Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?

#!/bin/bash

NR_CPUS=4
count=0


for z in `echo {a..z}` ;
do
 for x in `echo {a..z}` ;
 do
  for y in `echo {a..z}` ;
  do
   for ids in $(cat input.sam|awk '{print $1}');  
   do 
    grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
    let count+=1
                                [[ $((count%NR_CPUS)) -eq 0 ]] && wait
   done
  done #&

Upvotes: 10

Views: 14149

Answers (4)

Ole Tange
Ole Tange

Reputation: 33685

Using GNU Parallel it would look like this:

awk '{print $1}' input.sam > idsFile.txt
doit() {
   LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

If the order of the lines is not important this will be a bit faster:

parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

Upvotes: 0

dogbane
dogbane

Reputation: 274532

A few things you can try:

1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.

2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.

3) Use fgrep because you're searching for a fixed string, not a regular expression.

4) Use -f to make grep read patterns from a file, rather than using a loop.

5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.

After making those changes, this is what your script would become:

awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
 for x in {a..z}
 do
  for y in {a..z}
  do
    LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
  done >> output.txt

Also, check out GNU Parallel which is designed to help you run jobs in parallel.

Upvotes: 16

peteches
peteches

Reputation: 3609

ok I have a test file containing 4 character strings ie aaaa aaab aaac etc

ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real    0m19.250s
user    0m8.578s
sys     0m1.254s


time grep --mmap -e aaa -e bbb test.txt
<output>
real    0m18.087s
user    0m8.709s
sys     0m1.198s

So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take @BrianAgnew's advice and use a single invocation of grep try the --mmap option.

Though it should be noted that mmap can be a bit quirky if the source files changes during the search. from man grep

--mmap

If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.

Upvotes: 0

Brian Agnew
Brian Agnew

Reputation: 272237

My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation

e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.

If you have to use grep, you can use

grep -f filename

to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.

Upvotes: 4

Related Questions