Reputation: 5041
grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Upvotes: 160
Views: 119839
Reputation: 388
Try ripgrep
It provides much better results compared to grep.
For example, on a live test (11Gb Mailbox archive)
rg
(ripgrep)
time rg -c "^From " ~/Documents/archive.mbox
99176
rg -c "^From " ~/Documents/archive.mbox
1.38s user 5.24s system 62% cpu 10.681 total
vs grep
time grep -c "^From " ~/Documents/archive.mbox
99176
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -c
125.56s user 6.61s system 98% cpu 2:13.56 total
Note that I've had better rg results that 10sec (6sec best time so far) for the same 11Gb file. Grep consistently takes more than 2 mins.
Upvotes: 4
Reputation: 2925
hmm…… what speeds do you need ? i created a synthetic 77.6 GB
file with nearly 525 mn
rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5
, using rand()
not just NR % 243
, to place the string db_pd.Clients
at a random position in the middle of the existing text, totaling 2.16 mn rows
where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2
took just 59 seconds
to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s
, as seen above calculated by pv
, it might even be detrimental to use utils like parallel
to split the tasks.
Upvotes: 0
Reputation: 21
All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!
Upvotes: 2
Reputation: 1280
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Upvotes: 1
Reputation: 54592
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep
include:
-i
flag.-F
flag for a fixed stringLANG=C
-m
flag.Upvotes: 50
Reputation: 274888
Here are a few options:
1) Prefix your grep command with LC_ALL=C
to use the C locale instead of UTF-8.
2) Use fgrep
because you're searching for a fixed string, not a regular expression.
3) Remove the -i
option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
Upvotes: 207
Reputation: 16947
Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the .
by \.
A single point is the regex symbol to match any character, which is also slow
Upvotes: 10
Reputation: 65342
Two lines of attack:
-i
, or do you habe a possibility to get rid of it?grep
is single-threaded, so you might want to start more of them at different offsets.Upvotes: 3