GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern

Question

Lets say I have a log file which looks like this:

06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

What I wish to do is if:

any line contains "Matched Line", I need to get the unique id in column 4 (e.g. 06z07mjBYxFpzs) and,
search other lines having that unique id + the text "Some Data xxyyzz" and,
print the line that has the matching patterns of (unique id + "Some Data xxyyzz") on the console as final output.

So in this case the output should be:

06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

The file I am talking here is a huge file (~200 GB file; having millions of records), on a shared server, so I cannot run scripts or commands that would take a lot of time.

[EDIT] - Currently doing through fgrep by printing the unique ids from Matched Line in one file and Some Data xxyyzz in other; but looking for a single line grep, awk or sed command (without having to create multiple files to fgrep)

[EDIT 2] - this output is not in a file, rather this is an intermediate output of a series of grep and sort.

[EDIT 3] - Updated Sample Input (not in order but jumbled):

06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

John1024 · Accepted Answer

Ordered Data

The following just goes through the file once and therefore should be fast:

$ awk '/Matched Line/{id=$4;next;} id==$4' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz

In the sample input (original question), all the Some Data lines immediately follow their Matched Line. This enables this fast and simple solution.

How to use in a pipeline

awk works well in pipelines. If the input is not from a file but, as in Edit 2, from a pipeline, then use something like:

cmd1



How it works


/Matched Line/{id=$4;next;}

Any time we find a line containing the text Matched Line, we save its ID in the variable id.  Since we do not want to print Matched Line, we tell awk to skip the rest of the commands and jump to the next line.
id==$4

Any time that the current line has an ID (field 4) that matches our saved id, we print the line.

(In awk terminology, id==$4 is a condition: it evaluates to true or false.  When the condition is true, the action is performed.  In this case, we specified no action so awk performs the default action which is to print the line.)


Partially Ordered Data

In Edit 3, the data lines can appear at some random location after the matched line.  In that case:

$ awk '/Matched Line/{id[$4]=1;next;} id[$4]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 


Or, in a pipeline:

cmd1 file.log | awk '/Matched Line/{id[$4]=1;next;} id[$4]'

GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern

Answers (2)

Ordered Data

How to use in a pipeline

How it works

Partially Ordered Data

Related Questions