I have one huge file (over 6GB) and about 1000 patterns. I want extract lines matching each of the pattern to separate file. For example my patterns are: 1 2 my file: a|1 b|2 c|3 d|123 As a output I would like to have 2 files: 1: a|1 d|123 2: b|2 d|123 I can do it by greping file multiple times, but it is inefficient for 1000 patterns and huge file. I also tried something like this: grep -f pattern_file huge_file but it will make only 1 output file. I can't sort my huge file - it takes to much time. Maybe AWK will make it?

Reputation: 539

Grep to multiple output files

I have one huge file (over 6GB) and about 1000 patterns. I want extract lines matching each of the pattern to separate file. For example my patterns are:

1
2

my file:

a|1
b|2
c|3
d|123

As a output I would like to have 2 files:

a|1
d|123

b|2
d|123

I can do it by greping file multiple times, but it is inefficient for 1000 patterns and huge file. I also tried something like this:

grep -f pattern_file huge_file

but it will make only 1 output file. I can't sort my huge file - it takes to much time. Maybe AWK will make it?

Upvotes: 4

Answers (5)

Steve Summit

Reputation: 1

I had this need, so I added the capability to my own copy of grep.c that I happened to have lying around. But it just occurred to me: if the primary goal is to avoid multiple passes over a huge input, you could run egrep once on the huge input to search for any of your patterns (which, I know, is not what you want), and redirect its output to an intermediate file, then make multiple passes over that intermediate file, once per individual pattern, redirecting to a different final output file each time.

Upvotes: 0

michael

Reputation: 9819

You can accomplish this (if I understand the problem) using bash "process substitution", e.g., consider the following sample data:

$ cal -h
   September 2013     
Su Mo Tu We Th Fr Sa  
 1  2  3  4  5  6  7  
 8  9 10 11 12 13 14  
15 16 17 18 19 20 21  
22 23 24 25 26 27 28  
29 30

Then selective lines can be grepd to different output files in a single command as:

$ cal -h \
    | tee >( egrep '1'    > f1.txt ) \
    | tee >( egrep '2'    > f2.txt ) \
    | tee >( egrep 'Sept' > f3.txt )

In this case, each grep is processing the entire data stream (which may or may not be what you want: this may not save a lot of time vs. just running concurrent grep processes):

$ more  f?.txt
::::::::::::::
f1.txt
::::::::::::::
   September 2013     
 1  2  3  4  5  6  7  
 8  9 10 11 12 13 14  
15 16 17 18 19 20 21  
::::::::::::::
f2.txt
::::::::::::::
   September 2013     
 1  2  3  4  5  6  7  
 8  9 10 11 12 13 14  
15 16 17 18 19 20 21  
22 23 24 25 26 27 28  
29 30                 
::::::::::::::
f3.txt
::::::::::::::
   September 2013

Upvotes: 5

Dimitre Radoulov

Reputation: 28010

awk -F\| 'NR == FNR {
  patt[$0]; next
  }
{
  for (p in patt)
    if ($2 ~ p) print > p
  }' patterns huge_file

With some awk implementations you may hit the max number of open files limit. Let me know if that's the case so I can post an alternative solution.

P.S.: This version will keep only one file open at a time:

awk -F\| 'NR == FNR {
  patt[$0]; next
  }
{
  for (p in patt) {
    if ($2 ~ p) print >> p
    close(p)
    }
  }' patterns huge_file

Upvotes: 5

potong

Reputation: 58578

This might work for you (although sed might not be the quickest tool!):

 sed 's,.*,/&/w &_file,' pattern_file > sed_file

Then run this file against the source:

 sed -nf sed_file huge_file

I did a cursory test and the GNU sed version 4.1.5 I was using, easily opened 1000 files OK, however your unix system may well have smaller limits.

Upvotes: 1

Llamageddon

Reputation: 3526

Grep cannot output matches of different patterns to different files. Tee is able to redirect it's input into multiple destinations, but i don't think this is what you want.

Either use multiple grep commands or write a program to do it in Python or whatever else language you fancy.

Upvotes: 0

Grep to multiple output files

Answers (5)

Related Questions