Reputation: 9
Lets say I have a log file which looks like this:
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
What I wish to do is if:
"Matched Line"
, I need to get the unique id in column 4 (e.g. 06z07mjBYxFpzs
) and,"Some Data xxyyzz"
and,"Some Data xxyyzz"
) on the console as final output.So in this case the output should be:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
The file I am talking here is a huge file (~200 GB file; having millions of records), on a shared server, so I cannot run scripts or commands that would take a lot of time.
[EDIT] - Currently doing through fgrep by printing the unique ids from Matched Line
in one file and Some Data xxyyzz
in other; but looking for a single line grep
, awk
or sed
command (without having to create multiple files to fgrep
)
[EDIT 2] - this output is not in a file, rather this is an intermediate output of a series of grep
and sort
.
[EDIT 3] - Updated Sample Input (not in order but jumbled):
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
Upvotes: 1
Views: 2786
Reputation: 3100
grep "Matched Line" data.txt | awk '{print $4}' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
Output:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
Alternatively, using bash's process substitution, we can reduce the number of times that the file data.txt
has to be read to just two:
grep -f <(grep "Matched Line" data.txt | awk '{print $4}') data.txt | grep -v "Matched Line"
Upvotes: 1
Reputation: 113824
The following just goes through the file once and therefore should be fast:
$ awk '/Matched Line/{id=$4;next;} id==$4' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
In the sample input (original question), all the Some Data
lines immediately follow their Matched Line
. This enables this fast and simple solution.
awk works well in pipelines. If the input is not from a file but, as in Edit 2, from a pipeline, then use something like:
cmd1 <file.log | cmd2 | awk '/Matched Line/{id=$4;next;} id==$4' | cmd3
/Matched Line/{id=$4;next;}
Any time we find a line containing the text Matched Line
, we save its ID in the variable id
. Since we do not want to print Matched Line
, we tell awk to skip the rest of the commands and jump to the next
line.
id==$4
Any time that the current line has an ID (field 4) that matches our saved id
, we print the line.
(In awk terminology, id==$4
is a condition: it evaluates to true or false. When the condition is true, the action is performed. In this case, we specified no action so awk performs the default action which is to print the line.)
In Edit 3, the data lines can appear at some random location after the matched line. In that case:
$ awk '/Matched Line/{id[$4]=1;next;} id[$4]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
Or, in a pipeline:
cmd1 file.log | awk '/Matched Line/{id[$4]=1;next;} id[$4]'
Upvotes: 3