Mary
Mary

Reputation: 1

grep extract specific pattern from a large file of 190GB

I need to extract the email addresses from a large 190GB flat file (only error log) that I cut into a 5mb files. (with 152,353,216 lines)

The grep command works well, but the memory quickly becomes saturated and I end up getting errors.

The content of the files is not formatted, so I have to use regexp.

grep -r -E -h -o "\b(pattern)\b" /dir/* > outs.txt

How to process files one by one?

Upvotes: 1

Views: 555

Answers (3)

Pierre-Olivier Vares
Pierre-Olivier Vares

Reputation: 1769

Use xargs to execute your grep command on each file separately (rather than on all the files)

ls -1 /dir/ | xargs -n 1 -I '{}' grep -r -E -h -o "\b(pattern)\b" '{}' > outs.txt

The -n 1 flag instructs xargs to run one process on each file.

The -I '{}' argument instructs xargs to replace '{}' with the name of the file.

In other words, if /dir contains file1, file2, ..., it executes successively

grep -r -E -h -o "\b(pattern)\b" /dir/file1
grep -r -E -h -o "\b(pattern)\b" /dir/file2
grep -r -E -h -o "\b(pattern)\b" /dir/file3...

Upvotes: 1

Mark Setchell
Mark Setchell

Reputation: 207415

Depending on your data, your disk performance and your CPU, you may get on better with GNU Parallel. If you use the --pipepart option it will also split your 190GB file up for you without creating temporary files.

So, I created a 5GB file with 100000000 lines using Perl like this:

perl -E 'for($i=0;$i<100000000;$i++){say "Line $i,field2,field3,junk,junk,junk",int rand 1000000}' > BigBoy.txt

The first 3 lines look like this:

Line 0,field2,field3,junk,junk,junk514649
Line 1,field2,field3,junk,junk,junk257773
Line 2,field2,field3,junk,junk,junk203414

I then timed a grep at 58 seconds on that file that produced 88 lines of output:

time grep "junk426888$" BigBoy.txt

I then timed GNU Parallel at 11 seconds for the same output:

time parallel -a BigBoy.txt --pipepart --block -1 grep "junk426888$"

Upvotes: 2

Tom Fenech
Tom Fenech

Reputation: 74595

The simplest (but probably not the quickest) way to process all the files would be to do so one by one, using a loop:

for file in /dir/*; do
  grep -r -E -h -o '\b(pattern)\b' "$file"
done > outs.txt

The overhead of launching all those greps is potentially quite significant, though, so maybe you could use xargs to help:

find /dir/ -maxdepth 1 -type f -print0 |
  xargs -0 -n 1000 grep -r -E -h -o '\b(pattern)\b' > outs.txt

This uses find to produce the list of files in dir and passes them safely to xargs, separated by a null byte \0 (a character guaranteed not to be in a filename). xargs then passes the files to grep in batches of 1000.

(I'm assuming that you have GNU versions of find and xargs here, for find -print0 and xargs -0)

Upvotes: 1

Related Questions