zzapper
zzapper

Reputation: 5041

Grepping a huge file (80GB) any way to speed it up?

 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

Upvotes: 160

Views: 119839

Answers (8)

Shailesh
Shailesh

Reputation: 388

Try ripgrep

It provides much better results compared to grep.

For example, on a live test (11Gb Mailbox archive)

rg (ripgrep)

time rg -c "^From " ~/Documents/archive.mbox
99176
rg -c "^From " ~/Documents/archive.mbox  
1.38s user 5.24s system 62% cpu 10.681 total

vs grep

time grep -c "^From " ~/Documents/archive.mbox
99176
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -c    
125.56s user 6.61s system 98% cpu 2:13.56 total

Note that I've had better rg results that 10sec (6sec best time so far) for the same 11Gb file. Grep consistently takes more than 2 mins.

Upvotes: 4

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2925

hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :

rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.

and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits

rows       = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.


% dtp;  pvE0 < testfile_gigantic_001.txt| 
        mawk2 '
        _^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
                           ?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','     

  in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%            
 out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
  
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372

And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.

At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

Upvotes: 0

Smita
Smita

Reputation: 21

All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.

However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -

startline=$(grep -n -m 1 "$start_pattern"  file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern"  file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))

Then work on this subset of logs!

Upvotes: 2

user584583
user584583

Reputation: 1280

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

Upvotes: 1

Steve
Steve

Reputation: 54592

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

  • Dropping the -i flag.
  • Using the -F flag for a fixed string
  • Disabling NLS with LANG=C
  • Setting a max number of matches with the -m flag.

Upvotes: 50

dogbane
dogbane

Reputation: 274888

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

Upvotes: 207

BeniBela
BeniBela

Reputation: 16947

Some trivial improvement:

  • Remove the -i option, if you can, case insensitive is quite slow.

  • Replace the . by \.

    A single point is the regex symbol to match any character, which is also slow

Upvotes: 10

Eugen Rieck
Eugen Rieck

Reputation: 65342

Two lines of attack:

  • are you sure, you need the -i, or do you habe a possibility to get rid of it?
  • Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

Upvotes: 3

Related Questions