Helparod
Helparod

Reputation: 3

shell: Get line from FILE1 by content in FILE2

I have a file (maillog) like this:

    Feb 22 23:53:39 info postfix[102]: connect from APVLDPDF01[...
    Feb 22 23:53:39 info postfix[101]: BA1D7805A1: client=APVLDPDF01[...
    Feb 22 23:53:39 info postfix[103]: BA1D7805A1: message-id 
    Feb 22 23:53:39 info opendkim[139]: BA1D7805A1: DKIM-Signature field added
    Feb 22 23:53:39 info postfix[763]: ED6F3805B9: to=<[email protected]>, relay...
    Feb 22 23:53:39 info postfix[348]: ED6F3805B9: removed
    Feb 22 23:53:39 info postfix[348]: BA1D7805A1: from=<[email protected]>,...
    Feb 22 23:53:39 info postfix[102]: disconnect from APVLDPDF01...
    Feb 22 23:53:39 info postfix[842]: 59AE0805B4: to=<[email protected]>,status=sent
    Feb 22 23:53:39 info postfix[348]: 59AE0805B4: removed
    Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<[email protected]>, status=sent
    Feb 22 23:53:41 info postfix[348]: BA1D7805A1: removed

and a second file (mailids) like this:

    6DBDD8039F:
    3B15BC803B:
    BA1D7805A1:
    2BD19803B4:

I want to get an output file that contains something like this:

    Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<[email protected]>, status=sent

Just the lines that the ID exists in the second file, in this example just the ID = BA1D7805A1: is in the file one. But there's another condition, this line must be "ID to=<" it means that just the lines that contain "to=<" and the ID in file two can be output.

I've found differents solutions, but I have a huge problem about the performance. The maillog file size is 2GB, and its about 10millions lines. And the mailid file have around 32000 lines.

The process takes too much time, and I've never seen finished it. I've tried with awk and grep commands, but I dont find the best way.

Upvotes: 0

Views: 102

Answers (2)

BMW
BMW

Reputation: 45353

better to add -w option

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

Here is the common command I use.

grep -Fwf mailids maillog |grep 'to=<'

and if the ID is fixed at column 6, try this one-liner awk command

awk 'NR==FNR{a[$1];next} /to=</&&$6 in a ' mailids maillog

Upvotes: 1

Digital Trauma
Digital Trauma

Reputation: 16016

grep -F -f mailids maillog | grep 'to=<'

From the grep man page:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

Upvotes: 2

Related Questions