BadwolF
BadwolF

Reputation: 9

GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern

Lets say I have a log file which looks like this:

06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

What I wish to do is if:

So in this case the output should be:

06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

The file I am talking here is a huge file (~200 GB file; having millions of records), on a shared server, so I cannot run scripts or commands that would take a lot of time.

[EDIT] - Currently doing through fgrep by printing the unique ids from Matched Line in one file and Some Data xxyyzz in other; but looking for a single line grep, awk or sed command (without having to create multiple files to fgrep)

[EDIT 2] - this output is not in a file, rather this is an intermediate output of a series of grep and sort.

[EDIT 3] - Updated Sample Input (not in order but jumbled):

06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

Upvotes: 1

Views: 2786

Answers (2)

Richard
Richard

Reputation: 3100

grep "Matched Line" data.txt  | awk '{print $4}' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
  1. Search for all "Matched Lines"
  2. Print to stdout the 4th element in the line
  3. For each line in the output run grep: search for the printed id
  4. And search again but without "Matched Line"

Output:

06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 

Alternatively, using bash's process substitution, we can reduce the number of times that the file data.txt has to be read to just two:

grep -f <(grep "Matched Line" data.txt  | awk '{print $4}') data.txt | grep -v "Matched Line"

Upvotes: 1

John1024
John1024

Reputation: 113824

Ordered Data

The following just goes through the file once and therefore should be fast:

$ awk '/Matched Line/{id=$4;next;} id==$4' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz

In the sample input (original question), all the Some Data lines immediately follow their Matched Line. This enables this fast and simple solution.

How to use in a pipeline

awk works well in pipelines. If the input is not from a file but, as in Edit 2, from a pipeline, then use something like:

cmd1 <file.log | cmd2 | awk '/Matched Line/{id=$4;next;} id==$4' | cmd3

How it works

  • /Matched Line/{id=$4;next;}

    Any time we find a line containing the text Matched Line, we save its ID in the variable id. Since we do not want to print Matched Line, we tell awk to skip the rest of the commands and jump to the next line.

  • id==$4

    Any time that the current line has an ID (field 4) that matches our saved id, we print the line.

    (In awk terminology, id==$4 is a condition: it evaluates to true or false. When the condition is true, the action is performed. In this case, we specified no action so awk performs the default action which is to print the line.)

Partially Ordered Data

In Edit 3, the data lines can appear at some random location after the matched line. In that case:

$ awk '/Matched Line/{id[$4]=1;next;} id[$4]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 

Or, in a pipeline:

cmd1 file.log | awk '/Matched Line/{id[$4]=1;next;} id[$4]'

Upvotes: 3

Related Questions