Reputation: 1
I need to extract the email addresses from a large 190GB flat file (only error log) that I cut into a 5mb files. (with 152,353,216 lines)
The grep command works well, but the memory quickly becomes saturated and I end up getting errors.
The content of the files is not formatted, so I have to use regexp.
grep -r -E -h -o "\b(pattern)\b" /dir/* > outs.txt
How to process files one by one?
Upvotes: 1
Views: 555
Reputation: 1769
Use xargs
to execute your grep command on each file separately (rather than on all the files)
ls -1 /dir/ | xargs -n 1 -I '{}' grep -r -E -h -o "\b(pattern)\b" '{}' > outs.txt
The -n 1
flag instructs xargs to run one process on each file.
The -I '{}'
argument instructs xargs to replace '{}' with the name of the file.
In other words, if /dir
contains file1
, file2
, ...
, it executes successively
grep -r -E -h -o "\b(pattern)\b" /dir/file1
grep -r -E -h -o "\b(pattern)\b" /dir/file2
grep -r -E -h -o "\b(pattern)\b" /dir/file3...
Upvotes: 1
Reputation: 207415
Depending on your data, your disk performance and your CPU, you may get on better with GNU Parallel. If you use the --pipepart
option it will also split your 190GB file up for you without creating temporary files.
So, I created a 5GB file with 100000000 lines using Perl like this:
perl -E 'for($i=0;$i<100000000;$i++){say "Line $i,field2,field3,junk,junk,junk",int rand 1000000}' > BigBoy.txt
The first 3 lines look like this:
Line 0,field2,field3,junk,junk,junk514649
Line 1,field2,field3,junk,junk,junk257773
Line 2,field2,field3,junk,junk,junk203414
I then timed a grep
at 58 seconds on that file that produced 88 lines of output:
time grep "junk426888$" BigBoy.txt
I then timed GNU Parallel at 11 seconds for the same output:
time parallel -a BigBoy.txt --pipepart --block -1 grep "junk426888$"
Upvotes: 2
Reputation: 74595
The simplest (but probably not the quickest) way to process all the files would be to do so one by one, using a loop:
for file in /dir/*; do
grep -r -E -h -o '\b(pattern)\b' "$file"
done > outs.txt
The overhead of launching all those grep
s is potentially quite significant, though, so maybe you could use xargs
to help:
find /dir/ -maxdepth 1 -type f -print0 |
xargs -0 -n 1000 grep -r -E -h -o '\b(pattern)\b' > outs.txt
This uses find
to produce the list of files in dir
and passes them safely to xargs
, separated by a null byte \0
(a character guaranteed not to be in a filename). xargs
then passes the files to grep
in batches of 1000.
(I'm assuming that you have GNU versions of find
and xargs
here, for find -print0
and xargs -0
)
Upvotes: 1